Skip to content

Commit

Permalink
Ravin Kohli: [ADD] Subsampling Dataset (automl#398)
Browse files Browse the repository at this point in the history
  • Loading branch information
Github Actions committed Mar 9, 2022
1 parent f002c93 commit 742f1f1
Show file tree
Hide file tree
Showing 34 changed files with 490 additions and 352 deletions.
Binary file not shown.
Binary file not shown.
Binary file modified development/_images/sphx_glr_example_plot_over_time_001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified development/_images/sphx_glr_example_plot_over_time_thumb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified development/_images/sphx_glr_example_visualization_001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified development/_images/sphx_glr_example_visualization_thumb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 24 additions & 10 deletions development/_modules/autoPyTorch/api/tabular_classification.html
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,8 @@ <h1>Source code for autoPyTorch.api.tabular_classification</h1><div class="highl
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">autoPyTorch.data.tabular_validator</span> <span class="kn">import</span> <span class="n">TabularInputValidator</span>
<span class="kn">from</span> <span class="nn">autoPyTorch.data.utils</span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">get_dataset_compression_mapping</span>
<span class="n">DatasetCompressionSpec</span><span class="p">,</span>
<span class="n">get_dataset_compression_mapping</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">autoPyTorch.datasets.base_dataset</span> <span class="kn">import</span> <span class="n">BaseDatasetPropertiesType</span>
<span class="kn">from</span> <span class="nn">autoPyTorch.datasets.resampling_strategy</span> <span class="kn">import</span> <span class="p">(</span>
Expand Down Expand Up @@ -279,7 +280,7 @@ <h1>Source code for autoPyTorch.api.tabular_classification</h1><div class="highl
<span class="n">resampling_strategy</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">ResamplingStrategies</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="n">resampling_strategy_args</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="n">dataset_name</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="n">dataset_compression</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Mapping</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="n">dataset_compression</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">DatasetCompressionSpec</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">TabularDataset</span><span class="p">,</span> <span class="n">TabularInputValidator</span><span class="p">]:</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Returns an object of `TabularDataset` and an object of</span>
Expand All @@ -303,6 +304,10 @@ <h1>Source code for autoPyTorch.api.tabular_classification</h1><div class="highl
<span class="sd"> in ```datasets/resampling_strategy.py```.</span>
<span class="sd"> dataset_name (Optional[str]):</span>
<span class="sd"> name of the dataset, used as experiment name.</span>
<span class="sd"> dataset_compression (Optional[DatasetCompressionSpec]):</span>
<span class="sd"> specifications for dataset compression. For more info check</span>
<span class="sd"> documentation for `BaseTask.get_dataset`.</span>

<span class="sd"> Returns:</span>
<span class="sd"> TabularDataset:</span>
<span class="sd"> the dataset object.</span>
Expand Down Expand Up @@ -509,14 +514,23 @@ <h1>Source code for autoPyTorch.api.tabular_classification</h1><div class="highl
<span class="sd"> listed in ``&quot;methods&quot;`` will not be performed.</span>

<span class="sd"> **methods**</span>
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
<span class="sd"> * ``&quot;precision&quot;`` - We reduce floating point precision as follows:</span>
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
<span class="sd"> * pandas dataframes are reduced using the downcast option of `pd.to_numeric`</span>
<span class="sd"> to the lowest possible precision.</span>
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
<span class="sd"> * ``&quot;precision&quot;`` -</span>
<span class="sd"> We reduce floating point precision as follows:</span>
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
<span class="sd"> * pandas dataframes are reduced using the downcast option of `pd.to_numeric`</span>
<span class="sd"> to the lowest possible precision.</span>
<span class="sd"> * ``subsample`` -</span>
<span class="sd"> We subsample data such that it **fits directly into</span>
<span class="sd"> the memory allocation** ``memory_allocation * memory_limit``.</span>
<span class="sd"> Therefore, this should likely be the last method listed in</span>
<span class="sd"> ``&quot;methods&quot;``.</span>
<span class="sd"> Subsampling takes into account classification labels and stratifies</span>
<span class="sd"> accordingly. We guarantee that at least one occurrence of each</span>
<span class="sd"> label is included in the sampled set.</span>

<span class="sd"> Returns:</span>
<span class="sd"> self</span>
Expand Down
33 changes: 23 additions & 10 deletions development/_modules/autoPyTorch/api/tabular_regression.html
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,8 @@ <h1>Source code for autoPyTorch.api.tabular_regression</h1><div class="highlight
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">autoPyTorch.data.tabular_validator</span> <span class="kn">import</span> <span class="n">TabularInputValidator</span>
<span class="kn">from</span> <span class="nn">autoPyTorch.data.utils</span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">get_dataset_compression_mapping</span>
<span class="n">DatasetCompressionSpec</span><span class="p">,</span>
<span class="n">get_dataset_compression_mapping</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">autoPyTorch.datasets.base_dataset</span> <span class="kn">import</span> <span class="n">BaseDatasetPropertiesType</span>
<span class="kn">from</span> <span class="nn">autoPyTorch.datasets.resampling_strategy</span> <span class="kn">import</span> <span class="p">(</span>
Expand Down Expand Up @@ -280,7 +281,7 @@ <h1>Source code for autoPyTorch.api.tabular_regression</h1><div class="highlight
<span class="n">resampling_strategy</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">ResamplingStrategies</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="n">resampling_strategy_args</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="n">dataset_name</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="n">dataset_compression</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Mapping</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="n">dataset_compression</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">DatasetCompressionSpec</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">TabularDataset</span><span class="p">,</span> <span class="n">TabularInputValidator</span><span class="p">]:</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Returns an object of `TabularDataset` and an object of</span>
Expand All @@ -304,6 +305,9 @@ <h1>Source code for autoPyTorch.api.tabular_regression</h1><div class="highlight
<span class="sd"> in ```datasets/resampling_strategy.py```.</span>
<span class="sd"> dataset_name (Optional[str]):</span>
<span class="sd"> name of the dataset, used as experiment name.</span>
<span class="sd"> dataset_compression (Optional[DatasetCompressionSpec]):</span>
<span class="sd"> specifications for dataset compression. For more info check</span>
<span class="sd"> documentation for `BaseTask.get_dataset`.</span>
<span class="sd"> Returns:</span>
<span class="sd"> TabularDataset:</span>
<span class="sd"> the dataset object.</span>
Expand Down Expand Up @@ -510,14 +514,23 @@ <h1>Source code for autoPyTorch.api.tabular_regression</h1><div class="highlight
<span class="sd"> listed in ``&quot;methods&quot;`` will not be performed.</span>

<span class="sd"> **methods**</span>
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
<span class="sd"> * ``&quot;precision&quot;`` - We reduce floating point precision as follows:</span>
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
<span class="sd"> * pandas dataframes are reduced using the downcast option of `pd.to_numeric`</span>
<span class="sd"> to the lowest possible precision.</span>
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
<span class="sd"> * ``&quot;precision&quot;`` -</span>
<span class="sd"> We reduce floating point precision as follows:</span>
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
<span class="sd"> * pandas dataframes are reduced using the downcast option of `pd.to_numeric`</span>
<span class="sd"> to the lowest possible precision.</span>
<span class="sd"> * ``subsample`` -</span>
<span class="sd"> We subsample data such that it **fits directly into</span>
<span class="sd"> the memory allocation** ``memory_allocation * memory_limit``.</span>
<span class="sd"> Therefore, this should likely be the last method listed in</span>
<span class="sd"> ``&quot;methods&quot;``.</span>
<span class="sd"> Subsampling takes into account classification labels and stratifies</span>
<span class="sd"> accordingly. We guarantee that at least one occurrence of each</span>
<span class="sd"> label is included in the sampled set.</span>

<span class="sd"> Returns:</span>
<span class="sd"> self</span>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,26 +85,23 @@ Image Classification
Pipeline Random Config:
________________________________________
Configuration(values={
'image_augmenter:GaussianBlur:use_augmenter': False,
'image_augmenter:GaussianBlur:sigma_min': 1.2329755725391824,
'image_augmenter:GaussianBlur:sigma_offset': 2.17995589356565,
'image_augmenter:GaussianBlur:use_augmenter': True,
'image_augmenter:GaussianNoise:use_augmenter': False,
'image_augmenter:RandomAffine:rotate': 237,
'image_augmenter:RandomAffine:scale_offset': 0.32734086551151986,
'image_augmenter:RandomAffine:shear': 43,
'image_augmenter:RandomAffine:translate_percent_offset': 0.2150833219469362,
'image_augmenter:RandomAffine:use_augmenter': True,
'image_augmenter:RandomCutout:p': 0.6425251463645631,
'image_augmenter:RandomCutout:use_augmenter': True,
'image_augmenter:Resize:use_augmenter': False,
'image_augmenter:ZeroPadAndCrop:percent': 0.2638607299100123,
'normalizer:__choice__': 'ImageNormalizer',
'image_augmenter:RandomAffine:use_augmenter': False,
'image_augmenter:RandomCutout:use_augmenter': False,
'image_augmenter:Resize:use_augmenter': True,
'image_augmenter:ZeroPadAndCrop:percent': 0.33852145254374955,
'normalizer:__choice__': 'NoNormalizer',
})

Fitting the pipeline...
________________________________________
ImageClassificationPipeline
________________________________________
0-) normalizer:
ImageNormalizer
NoNormalizer

1-) preprocessing:
EarlyPreprocessing
Expand Down Expand Up @@ -176,7 +173,7 @@ Image Classification
.. rst-class:: sphx-glr-timing

**Total running time of the script:** ( 0 minutes 8.801 seconds)
**Total running time of the script:** ( 0 minutes 6.608 seconds)


.. _sphx_glr_download_examples_20_basics_example_image_classification.py:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ Search for an ensemble of machine learning algorithms
.. code-block:: none
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7fcb49398130>
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f2712518fd0>
Expand Down Expand Up @@ -165,26 +165,21 @@ Print the final ensemble performance

.. code-block:: none
{'accuracy': 0.861271676300578}
{'accuracy': 0.8670520231213873}
| | Preprocessing | Estimator | Weight |
|---:|:-------------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
| 0 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,QuantileTransformer,KitchenSink | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.24 |
| 1 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.16 |
| 2 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,NoScaler,KitchenSink | embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.14 |
| 3 | None | CBLearner | 0.1 |
| 4 | None | SVMLearner | 0.08 |
| 5 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,NoScaler,KitchenSink | embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.06 |
| 6 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,NoScaler,KitchenSink | embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.06 |
| 7 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,SRC | embedding,MLPBackbone,FullyConnectedHead,nn.Sequential | 0.04 |
| 8 | None | LGBMLearner | 0.04 |
| 9 | None | RFLearner | 0.04 |
| 10 | None | KNNLearner | 0.04 |
| 0 | None | CBLearner | 0.32 |
| 1 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,QuantileTransformer,KitchenSink | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.2 |
| 2 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,SRC | embedding,MLPBackbone,FullyConnectedHead,nn.Sequential | 0.2 |
| 3 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,NoScaler,KitchenSink | embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.12 |
| 4 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,QuantileTransformer,KitchenSink | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.08 |
| 5 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,QuantileTransformer,KitchenSink | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.08 |
autoPyTorch results:
Dataset name: Australian
Optimisation Metric: accuracy
Best validation score: 0.8713450292397661
Number of target algorithm runs: 20
Number of successful target algorithm runs: 18
Number of target algorithm runs: 21
Number of successful target algorithm runs: 19
Number of crashed target algorithm runs: 0
Number of target algorithms that exceeded the time limit: 2
Number of target algorithms that exceeded the memory limit: 0
Expand All @@ -196,7 +191,7 @@ Print the final ensemble performance
.. rst-class:: sphx-glr-timing

**Total running time of the script:** ( 5 minutes 32.091 seconds)
**Total running time of the script:** ( 5 minutes 20.869 seconds)


.. _sphx_glr_download_examples_20_basics_example_tabular_classification.py:
Expand Down
Loading

0 comments on commit 742f1f1

Please sign in to comment.