-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADD] Add column transformer #305
Conversation
Match paper libraries-versions
Codecov Report
@@ Coverage Diff @@
## development #305 +/- ##
===============================================
+ Coverage 81.74% 81.82% +0.08%
===============================================
Files 151 151
Lines 8655 8646 -9
Branches 1330 1321 -9
===============================================
Hits 7075 7075
+ Misses 1109 1105 -4
+ Partials 471 466 -5
Continue to review full report at Codecov.
|
f1e837d
to
4b72887
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR
I checked everything except autoPyTorch/data/tabular_feature_validator.py.
@@ -51,7 +51,7 @@ def __init__(self, | |||
self.dtypes = [] # type: typing.List[str] | |||
self.column_order = [] # type: typing.List[str] | |||
|
|||
self.encoder = None # type: typing.Optional[BaseEstimator] | |||
self.column_transformer = None # type: typing.Optional[BaseEstimator] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.column_transformer = None # type: typing.Optional[BaseEstimator] | |
self.column_transformer: typing.Optional[BaseEstimator] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I wanted to leave these changes for your PR. Once my PR is merged, you can rebase and edit it. This will ensure we don't make the same changes twice and an easy merge.
def _create_column_transformer( | ||
preprocessors: Dict[str, List[BaseEstimator]], | ||
categorical_columns: List[str], | ||
) -> ColumnTransformer: | ||
""" | ||
Given a dictionary of preprocessors, this function | ||
creates a sklearn column transformer with appropriate | ||
columns associated with their preprocessors. | ||
|
||
Args: | ||
preprocessors (Dict[str, List[BaseEstimator]]): | ||
Dictionary containing list of numerical and categorical preprocessors. | ||
categorical_columns (List[str]): | ||
List of names of categorical columns | ||
|
||
Returns: | ||
ColumnTransformer | ||
""" | ||
|
||
categorical_pipeline = make_pipeline(*preprocessors['categorical']) | ||
|
||
return ColumnTransformer([ | ||
('categorical_pipeline', categorical_pipeline, categorical_columns)], | ||
remainder='passthrough' | ||
) | ||
|
||
|
||
def get_tabular_preprocessors() -> Dict[str, List[BaseEstimator]]: | ||
""" | ||
This function creates a Dictionary containing a list | ||
of numerical and categorical preprocessors | ||
|
||
Returns: | ||
Dict[str, List[BaseEstimator]] | ||
""" | ||
preprocessors: Dict[str, List[BaseEstimator]] = dict() | ||
|
||
# Categorical Preprocessors | ||
onehot_encoder = preprocessing.OrdinalEncoder(handle_unknown='use_encoded_value', | ||
unknown_value=-1) | ||
categorical_imputer = SimpleImputer(strategy='constant', copy=False) | ||
|
||
preprocessors['categorical'] = [categorical_imputer, onehot_encoder] | ||
|
||
return preprocessors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any reasons why you have not made this part same as the refactor_development_regularization_cocktail
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is intentioned to replace the previous _impute_nan_in_categoricals
with a Sklearn Imputer and thus add a column transformer. If we would like to make it same as refactor_development_regularization_cocktail
I think that can be better done by merging its PR #161 .
X = self.impute_nan_in_categories(X) | ||
|
||
X = self.encoder.transform(X) | ||
X = self.column_transformer.transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure:
We are imputing in the transform, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
|
||
return preprocessors | ||
|
||
|
||
class TabularFeatureValidator(BaseFeatureValidator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add doc-string about the attributes such as categories
, enc_columns
, column_transformer
, numerical_columns
, categorical_columns
, all_nan_columns
, column_order
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely.
Addresses #304
Specifically, this PR adds the following