[ADD] Add column transformer #305

ravinkohli · 2021-10-26T12:26:48Z

Addresses #304

Specifically, this PR adds the following

adds simple imputer to impute categorical columns.
Combined imputer and encoder to column transformer
moved comparator to a class method in tabular_feature_validator.py
adapt tests for these changes
Adds test for comparator

Match paper libraries-versions

codecov · 2021-10-26T12:51:10Z

Codecov Report

Merging #305 (a04ba08) into development (9002937) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           development     #305      +/-   ##
===============================================
+ Coverage        81.74%   81.82%   +0.08%     
===============================================
  Files              151      151              
  Lines             8655     8646       -9     
  Branches          1330     1321       -9     
===============================================
  Hits              7075     7075              
+ Misses            1109     1105       -4     
+ Partials           471      466       -5

Impacted Files	Coverage Δ
autoPyTorch/api/base_task.py	`84.46% <ø> (ø)`
autoPyTorch/data/base_feature_validator.py	`100.00% <100.00%> (ø)`
autoPyTorch/data/tabular_feature_validator.py	`91.19% <100.00%> (+4.88%)`	⬆️
...ts/setup/network_backbone/InceptionTimeBackbone.py	`98.92% <100.00%> (ø)`
autoPyTorch/ensemble/ensemble_builder.py	`72.32% <0.00%> (-0.84%)`	⬇️
...peline/components/training/trainer/base_trainer.py	`96.82% <0.00%> (+1.05%)`	⬆️
...ipeline/components/setup/network_backbone/utils.py	`88.72% <0.00%> (+1.50%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9002937...a04ba08. Read the comment docs.

…n docs building

nabenabe0928

Thanks for the PR
I checked everything except autoPyTorch/data/tabular_feature_validator.py.

test/test_data/test_feature_validator.py

nabenabe0928 · 2021-10-29T17:31:12Z

autoPyTorch/data/base_feature_validator.py

@@ -51,7 +51,7 @@ def __init__(self,
        self.dtypes = []  # type: typing.List[str]
        self.column_order = []  # type: typing.List[str]

-        self.encoder = None  # type: typing.Optional[BaseEstimator]
+        self.column_transformer = None  # type: typing.Optional[BaseEstimator]


Suggested change

self.column_transformer = None # type: typing.Optional[BaseEstimator]

self.column_transformer: typing.Optional[BaseEstimator] = None

So I wanted to leave these changes for your PR. Once my PR is merged, you can rebase and edit it. This will ensure we don't make the same changes twice and an easy merge.

nabenabe0928 · 2021-10-29T19:52:57Z

autoPyTorch/data/tabular_feature_validator.py

+def _create_column_transformer(
+    preprocessors: Dict[str, List[BaseEstimator]],
+    categorical_columns: List[str],
+) -> ColumnTransformer:
+    """
+    Given a dictionary of preprocessors, this function
+    creates a sklearn column transformer with appropriate
+    columns associated with their preprocessors.
+
+    Args:
+        preprocessors (Dict[str, List[BaseEstimator]]):
+            Dictionary containing list of numerical and categorical preprocessors.
+        categorical_columns (List[str]):
+            List of names of categorical columns
+
+    Returns:
+        ColumnTransformer
+    """
+
+    categorical_pipeline = make_pipeline(*preprocessors['categorical'])
+
+    return ColumnTransformer([
+        ('categorical_pipeline', categorical_pipeline, categorical_columns)],
+        remainder='passthrough'
+    )
+
+
+def get_tabular_preprocessors() -> Dict[str, List[BaseEstimator]]:
+    """
+    This function creates a Dictionary containing a list
+    of numerical and categorical preprocessors
+
+    Returns:
+        Dict[str, List[BaseEstimator]]
+    """
+    preprocessors: Dict[str, List[BaseEstimator]] = dict()
+
+    # Categorical Preprocessors
+    onehot_encoder = preprocessing.OrdinalEncoder(handle_unknown='use_encoded_value',
+                                                  unknown_value=-1)
+    categorical_imputer = SimpleImputer(strategy='constant', copy=False)
+
+    preprocessors['categorical'] = [categorical_imputer, onehot_encoder]
+
+    return preprocessors


Are there any reasons why you have not made this part same as the refactor_development_regularization_cocktail?

This PR is intentioned to replace the previous _impute_nan_in_categoricals with a Sklearn Imputer and thus add a column transformer. If we would like to make it same as refactor_development_regularization_cocktail I think that can be better done by merging its PR #161 .

nabenabe0928 · 2021-10-29T19:56:42Z

autoPyTorch/data/tabular_feature_validator.py

-            X = self.impute_nan_in_categories(X)
-
-            X = self.encoder.transform(X)
+            X = self.column_transformer.transform(X)


Just to make sure:
We are imputing in the transform, right?

nabenabe0928 · 2021-10-29T20:03:23Z

autoPyTorch/data/tabular_feature_validator.py

+
+    return preprocessors
+
+
 class TabularFeatureValidator(BaseFeatureValidator):


Could you add doc-string about the attributes such as categories, enc_columns, column_transformer, numerical_columns, categorical_columns, all_nan_columns, column_order?

Definitely.

franchuterivera and others added 9 commits March 1, 2021 18:07

Match paper libraries-versions

a0059db

Merge pull request automl#120 from automl/fix_pip_version

7c95381

Match paper libraries-versions

Update README.md

6bf49c6

Update README.md

70ce56b

Update README.md

865feb3

[FIX] master branch README (automl#209)

a3a3257

Enable github actions (automl#273)

295c538

Update README.md

251f475

Create CITATION.cff

96d9372

ravinkohli requested a review from nabenabe0928 October 26, 2021 12:29

ravinkohli added the enhancement New feature or request label Oct 26, 2021

ravinkohli added 2 commits October 26, 2021 14:34

Added column transformer, changed requirements and added tests

761170c

pull from upstream

4da0f38

remove redundant lines

4b72887

ravinkohli force-pushed the add-col_tfr branch from f1e837d to 4b72887 Compare October 26, 2021 15:01

ravinkohli added 6 commits October 26, 2021 17:04

Remove unwanted change made

657f765

Fix bug in test api and dummy forward pass

7f38bb1

Fix silly bugs

7c4beaf

increase time to pass test

30992e8

remove parallel capabilities of traditional learners to resolve bug i…

fae0025

…n docs building

almost fixed

510e598

nabenabe0928 reviewed Oct 29, 2021

View reviewed changes

ravinkohli added 4 commits November 2, 2021 17:21

Add documentation for tabularfeaturevalidator

e9cb0db

fix flake

50e5154

fix silly bug

755e725

address comment from shuhei

4468a49

ravinkohli added 4 commits November 3, 2021 11:30

rename enc_columns to transformed_columns in the remaining places

8cfae5d

fix bug in test

686f182

fix mypy

efde1d5

add shuhei's suggestion

a04ba08

nabenabe0928 merged commit a11caf4 into automl:development Nov 4, 2021

github-actions bot pushed a commit that referenced this pull request Nov 4, 2021

Ravin Kohli: [ADD] Add column transformer (#305)

50ec10a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADD] Add column transformer #305

[ADD] Add column transformer #305

ravinkohli commented Oct 26, 2021 •

edited

Loading

codecov bot commented Oct 26, 2021 •

edited

Loading

nabenabe0928 left a comment

nabenabe0928 Oct 29, 2021

ravinkohli Nov 2, 2021

nabenabe0928 Oct 29, 2021 •

edited

Loading

ravinkohli Nov 2, 2021

nabenabe0928 Oct 29, 2021

ravinkohli Nov 2, 2021

nabenabe0928 Oct 29, 2021

ravinkohli Nov 2, 2021

	self.column_transformer = None # type: typing.Optional[BaseEstimator]
	self.column_transformer: typing.Optional[BaseEstimator] = None


		return preprocessors


		class TabularFeatureValidator(BaseFeatureValidator):

[ADD] Add column transformer #305

[ADD] Add column transformer #305

Conversation

ravinkohli commented Oct 26, 2021 • edited Loading

codecov bot commented Oct 26, 2021 • edited Loading

Codecov Report

nabenabe0928 left a comment

Choose a reason for hiding this comment

nabenabe0928 Oct 29, 2021

Choose a reason for hiding this comment

ravinkohli Nov 2, 2021

Choose a reason for hiding this comment

nabenabe0928 Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

ravinkohli Nov 2, 2021

Choose a reason for hiding this comment

nabenabe0928 Oct 29, 2021

Choose a reason for hiding this comment

ravinkohli Nov 2, 2021

Choose a reason for hiding this comment

nabenabe0928 Oct 29, 2021

Choose a reason for hiding this comment

ravinkohli Nov 2, 2021

Choose a reason for hiding this comment

ravinkohli commented Oct 26, 2021 •

edited

Loading

codecov bot commented Oct 26, 2021 •

edited

Loading

nabenabe0928 Oct 29, 2021 •

edited

Loading