Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Add type hints in Dask package #3866

Merged
merged 25 commits into from
Jan 29, 2021
Merged

[dask] Add type hints in Dask package #3866

merged 25 commits into from
Jan 29, 2021

Conversation

jameslamb
Copy link
Collaborator

As part of #3756 , this PR proposes type hints for the Dask package.

This PR attempts to document the current state of the package, and isn't a proposal about the final state for release 3.2.0. So, for example, right now X and y in .predict() have to be Dask collections, but I could see it being useful to allow other data types in the future. I'd like to defer that discussion to other issues / PRs.

Notes for Reviewers

  • I'm pushing this on the docs/jlamb branch because that branch already has readthedocs build enabled, and I want to see what effect, if any, this change has on the documentation
  • I want to add isinstance() checks on the types of arguments to fit() and predict() that should be Dask collections, as mentioned in Support DataTable in Dask #3830 (comment), but I'm proposing that that be a separate PR
  • It might be weird to see that the hint for **kwargs is not Dict[str, Any]. I've learned that the hint for * and ** stuff is expected to be the set of allowable types for the VALUES. https://www.python.org/dev/peps/pep-0484/#arbitrary-argument-lists-and-default-argument-values
  • The annotation Type[] in a hint says "this should be a class, not an instance". So df_class: Type[pd.DataFrame] means you pass pd.DataFrame or a sub-class, where df: pd.DataFrame means a constructed data frame is expected.

I ran mypy python-package/lightgbm/dask.py. There are still some mypy issues but I don't think any are a direct result of this PR and some come from LightGBM's dependencies. mypy can also get easily confused when you use mixin classes, which we do a lot in the Dask package and the sklearn classes it imports. So I'm not worrying too much about the mypy stuff below right now (other than the one I fixed in #3865 ).

mypy results for the curious
python-package/lightgbm/dask.py:15: error: Skipping analyzing 'numpy': found module but no type hints or library stubs
python-package/lightgbm/dask.py:15: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
python-package/lightgbm/dask.py:16: error: Skipping analyzing 'scipy.sparse': found module but no type hints or library stubs
python-package/lightgbm/dask.py:16: error: Skipping analyzing 'scipy': found module but no type hints or library stubs
python-package/lightgbm/dask.py:71: error: Incompatible return value type (got "Optional[int]", expected "int")
python-package/lightgbm/dask.py:94: error: Need type annotation for 'lightgbm_ports' (hint: "lightgbm_ports: Set[<type>] = ...")
python-package/lightgbm/dask.py:111: error: Value of type "Iterable[Union[Any, Any, Any, Any]]" is not indexable
python-package/lightgbm/dask.py:113: error: Value of type "Iterable[Union[Any, Any, Any, Any]]" is not indexable
python-package/lightgbm/dask.py:115: error: Value of type "Iterable[Union[Any, Any, Any, Any]]" is not indexable
python-package/lightgbm/dask.py:118: error: Value of type "Iterable[Union[Any, Any, Any, Any]]" is not indexable
python-package/lightgbm/dask.py:246: error: Name 'tree_learner' is not defined
python-package/lightgbm/dask.py:285: error: "Dict[str, Any]" has no attribute "status"
python-package/lightgbm/dask.py:286: error: Incompatible return value type (got "Dict[str, Any]", expected "LGBMModel")
python-package/lightgbm/dask.py:289: error: "Dict[str, Any]" has no attribute "key"; maybe "keys"?
python-package/lightgbm/dask.py:455: error: "_DaskLGBMModel" has no attribute "get_params"
python-package/lightgbm/dask.py:468: error: "_DaskLGBMModel" has no attribute "set_params"
python-package/lightgbm/dask.py:474: error: "_DaskLGBMModel" has no attribute "get_params"
python-package/lightgbm/dask.py:480: error: "_DaskLGBMModel" has no attribute "get_params"
python-package/lightgbm/dask.py:490: error: Signature of "fit" incompatible with supertype "LGBMClassifier"
python-package/lightgbm/dask.py:490: error: Signature of "fit" incompatible with supertype "LGBMModel"
python-package/lightgbm/dask.py:499: error: Incompatible return value type (got "_DaskLGBMModel", expected "DaskLGBMClassifier")
python-package/lightgbm/dask.py:509: error: Item "None" of "Optional[str]" has no attribute "partition"
python-package/lightgbm/dask.py:515: error: Signature of "predict" incompatible with supertype "LGBMClassifier"
python-package/lightgbm/dask.py:515: error: Signature of "predict" incompatible with supertype "LGBMModel"
python-package/lightgbm/dask.py:526: error: Signature of "predict_proba" incompatible with supertype "LGBMClassifier"
python-package/lightgbm/dask.py:550: error: Signature of "fit" incompatible with supertype "LGBMRegressor"
python-package/lightgbm/dask.py:550: error: Signature of "fit" incompatible with supertype "LGBMModel"
python-package/lightgbm/dask.py:559: error: Incompatible return value type (got "_DaskLGBMModel", expected "DaskLGBMRegressor")
python-package/lightgbm/dask.py:569: error: Item "None" of "Optional[str]" has no attribute "partition"
python-package/lightgbm/dask.py:575: error: Signature of "predict" incompatible with supertype "LGBMModel"
python-package/lightgbm/dask.py:598: error: Signature of "fit" incompatible with supertype "LGBMRanker"
python-package/lightgbm/dask.py:598: error: Signature of "fit" incompatible with supertype "LGBMModel"
python-package/lightgbm/dask.py:612: error: Incompatible return value type (got "_DaskLGBMModel", expected "DaskLGBMRanker")
python-package/lightgbm/dask.py:623: error: Item "None" of "Optional[str]" has no attribute "partition"
python-package/lightgbm/dask.py:629: error: Signature of "predict" incompatible with supertype "LGBMModel"

@jameslamb
Copy link
Collaborator Author

@ffineis if you have time, could you take a look at this diff too? Since you've recently worked with the internals of the Dask module

@ffineis
Copy link
Contributor

ffineis commented Jan 27, 2021

@ffineis if you have time, could you take a look at this diff too? Since you've recently worked with the internals of the Dask module

Yep on it tonight

y: _DaskCollection,
sample_weight: Optional[_DaskCollection] = None,
init_score: Optional[_DaskCollection] = None,
group: Optional[_1DArrayLike] = None,
Copy link
Contributor

@ffineis ffineis Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this might be group: Optional[_DaskCollection], group array is still distributed at this point. Since group can be so much smaller than both X and y, I think supporting a locally-defined group input list or array is a noble cause, but this would need to be its own PR in which it was defined which parts of group get sent where to accompany the X and y parts.

If this comment stands, then _1DArrayLike can be removed at the top of this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooooo ok! I misunderstood what was happening. Awesome, I'll look again and then change that hint.

I'll also write up a feature request for this. I think it's something that's non-breaking and additive, that could be done after 3.2.0. But like you said, since group is often fairly small, I think it could be a nice thing to be able to specify it as a list or lil numpy array.

params: Dict[str, Any],
model_factory: Type[LGBMModel],
sample_weight: Optional[_DaskCollection] = None,
group: Optional[_1DArrayLike] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group: Optional[_DaskCollection] = None,

Would be great if the docs also reflected that sample_weight and group are, up to this point, still distributed vectors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the docs in this PR too, I think that's directly related to this scope. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this, I realized that with the way we do doc inheritance, it will actually be a bit tricky to override the docs for group.

I wrote up my thoughts in #3871, but I won't update the docs (except those on internal functions) in this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model = model_factory(**self.get_params())
self._copy_extra_params(self, model)
return model

@staticmethod
def _copy_extra_params(source, dest):
def _copy_extra_params(source: "_DaskLGBMModel", dest: "_DaskLGBMModel") -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that dest could refer to both an LGBMModel or a _DaskLGBMModel, referring to line to 473 (in _to_local).

def _copy_extra_params(source: "_DaskLGBMModel", dest: Union["_DaskLGBMModel", LGBMModel]) -> None:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoot, you're totally right. good eye

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in bfd9dc0

@@ -102,7 +107,7 @@ def _find_ports_for_workers(client: Client, worker_addresses: Iterable[str], loc
return worker_ip_to_port


def _concat(seq):
def _concat(seq: Iterable[_DaskPart]) -> _DaskPart:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that technically any Iterable (tuple/list/np.array/...) could work for seq here, but what is the benefit to using Iterable and not List? I guess you would never use the List type hint for an input parameter (i.e. always prefer Iterable over List) unless the function called some list-specific method like .sort or mutability...?

I only raise this because I came across this S/O comment: https://stackoverflow.com/a/52827511/14480058, not sure what to think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have any special reason for choosing Iterable instead of List. I was just looking at the code and didn't see any list-specific stuff so I thought it made sense to use the more general thing.

But since this is a totally-internally function, where we control the input, and since Iterable does weird stuff as that S/O post points out, I'll change this to List[]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in bfd9dc0

dtype=np.float32, **kwargs):
def _predict(
model: LGBMModel,
data: _DaskCollection,
Copy link
Contributor

@ffineis ffineis Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corresponding docstr for the data parameter reads dask array of shape = [n_samples, n_features] - consider changing it to data : dask array or dask DataFrame of shape = [n_samples, n_features]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup will do in this PR, thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in bfd9dc0

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 27, 2021

@jameslamb
Could you please replace all occurrences of dask array with dask Array for the consistency with dask DataFrame, or dask Series?

@jameslamb
Copy link
Collaborator Author

@jameslamb
Could you please replace all occurrences of dask array with dask Array for the consistency with dask DataFrame, or dask Series?

oh sure, no problem. done in b20ac37

@jameslamb
Copy link
Collaborator Author

The type hints are currently shown in method signatures on the docs site:

image

from https://lightgbm.readthedocs.io/en/docs-jlamb/pythonapi/lightgbm.DaskLGBMClassifier.html#lightgbm.DaskLGBMClassifier.fit

I think this makes the method signature really hard to read. I'm going to try disabling this by setting autodoc_typehints = "none" in conf.py (https://stackoverflow.com/a/63560720/3986677).

@StrikerRUS
Copy link
Collaborator

I'm going to try disabling this by setting autodoc_typehints = "none" in conf.py

Don't forget to bump minimum required Sphinx version then

https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#confval-autodoc_typehints

needs_sphinx = '1.3' # Due to sphinx.ext.napoleon

@jameslamb
Copy link
Collaborator Author

I'm going to try disabling this by setting autodoc_typehints = "none" in conf.py

Don't forget to bump minimum required Sphinx version then

https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#confval-autodoc_typehints

needs_sphinx = '1.3' # Due to sphinx.ext.napoleon

Oh thanks! Didn't know about that. I though the version was only managed in

sphinx >= 3.2.2

I'll update that. But happy to say the builds on RTD did pass, and the type hints were successfully hidden: https://lightgbm.readthedocs.io/en/docs-jlamb/pythonapi/lightgbm.DaskLGBMClassifier.html#lightgbm.DaskLGBMClassifier

image

@jameslamb jameslamb mentioned this pull request Jan 28, 2021
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure my comments will be useful, but anyway: 🙂

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except two comment below and one unresolved question from previous review: #3866 (comment).

Thanks a lot!

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

@jameslamb
Well, I see this branch is in sync with master and there is a Dask teardown error in regular job.
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=8861&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf

image

I'm not re-running the failed job to let you collect all details from log you may want to get familiar.

According to #3829 (comment) I'm going to increase timeout value for Dask tests.

@jameslamb
Copy link
Collaborator Author

@jameslamb
Well, I see this branch is in sync with master and there is a Dask teardown error in regular job.
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=8861&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf

image

I'm not re-running the failed job to let you collect all details from log you may want to get familiar.

According to #3829 (comment) I'm going to increase timeout value for Dask tests.

Thanks. I moved the discussion over to #3829 and re-opened it.

I pushed an empty commit here to re-run CI, since I don't have rights to re-run jobs in Azure.

@jameslamb
Copy link
Collaborator Author

Ok since #3866 (comment) has been moved to #3881 , I think this PR can be merged. Thanks for the reviews!

@jameslamb jameslamb merged commit ea8e47e into master Jan 29, 2021
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants