Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to set a device when loading a model #154

Merged

Conversation

leor-c
Copy link
Contributor

@leor-c leor-c commented Sep 1, 2020

Added a 'device' keyword argument to BaseAlgorithm.load(), to enable users to set the model to their device of choice.
Edited test_save_load to also test the load method with all possible devices.
Added the changes to the changelog (I'm not completely confident about the choice of words though).

Description

Added a 'device' keyword argument with default value 'auto', similarly to the hard-coded value that was used before my change to BaseAlgorithm.load(), and forwarded it to the C'tor call inside the load method (instead of the hard-coded string parameter).

Motivation and Context

I'm using this repo for my research and when I work with simpler models (tiny MLP for example) it is actually faster to use the CPU even though my machine has a powerful GPU since the overhead of the calls is higher than the benefits.
Thus, when I load the models I'm training, I need to be able to force them to load to the CPU easily.
Since I've seen that the code is already there but currently the device is chosen using a hard-coded string inside the load method, I suggested to make this little but significant change (:

Closes #153

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • I have updated the changelog accordingly (required).
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have reformatted the code using make format (required)
  • I have checked the codestyle using make check-codestyle and make lint (required)
  • I have ensured make pytest and make type both pass. (required)

Edited the save and load test to also test the load method with all possible devices.
Added the changes to the changelog
# check if params are still the same after load
new_params = model.policy.state_dict()
# Check if the model loads as expected for every possible choice of device:
for device in ["auto", "cpu", "cuda"]:
Copy link
Contributor Author

@leor-c leor-c Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the git code comparison looks quite messy. I'm elaborating about the changes I've made here to ease the review process for you:
The actual change that I made here is the added 'for' loop that goes over all possible devices, and at each iteration the device parameter is passed to the call of 'load' (line 76). At the end of each iteration I delete the model (line 92) so it can be loaded cleanly at the next iteration.
Everything else is the same as before, i.e., I've used the exact same test (inside the new 'for' loop) to ensure proper loading and tested with all possible values of the new argument 'device'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that you are actually not testing that the device parameter was successfully used.
Also, you should skip the cuda device if no GPU is available

Copy link
Contributor Author

@leor-c leor-c Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You're right. I will work on improving the test.
  2. What should be the expected behavior when a user uses "device='cuda'" on a machine with no GPU?
    I noticed that the c'tor defaults to using the CPU in that case without notifying the user.
    Anyway, I think the test should include all possible inputs while verifying that the outcome matches your expectations. Do you agree?

Copy link
Contributor Author

@leor-c leor-c Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used in my test the utils.get_device() function (which is used inside the constructor as well) to determine the device. This way, if for example, the behavior of get_device will change, the test won't break.

@leor-c leor-c requested a review from araffin September 2, 2020 18:37
tests/test_save_load.py Outdated Show resolved Hide resolved
@araffin's suggestion during the PR process

Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
@araffin
Copy link
Member

araffin commented Sep 2, 2020

running on GPU, it yields this error:

            # Check that all params are the same as before save load procedure now
            for key in params:
>               assert th.allclose(params[key], new_params[key]), "Model parameters not the same after save and load."
E               RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

tests/test_save_load.py:87: RuntimeError

Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
@leor-c
Copy link
Contributor Author

leor-c commented Sep 2, 2020

Thanks for all your help! I'll look into it.

…et_device() doesn't provide device index.

Now the code loads all of the model parameters from the saved state dict straight into the required device. (fixed load_from_zip_file).
@leor-c
Copy link
Contributor Author

leor-c commented Sep 4, 2020

When comparing the devices in the test, I restored the comparison of types only, since the "get_device()" function doesn't fill the device index, which causes problems.

@@ -352,6 +352,7 @@ def load_from_pkl(path: Union[str, pathlib.Path, io.BufferedIOBase], verbose=0)
def load_from_zip_file(
load_path: Union[str, pathlib.Path, io.BufferedIOBase],
load_data: bool = True,
device: Union[th.device, str] = "auto",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the order of the arguments here makes sense? I'm not sure if I should have added the new argument last, for cases where users didn't use explicit keyword arguments.
On the other hand, I think it makes more sense to be in front of 'verbose'...

@leor-c leor-c requested a review from araffin September 4, 2020 13:20
@leor-c
Copy link
Contributor Author

leor-c commented Sep 4, 2020

When comparing the devices in the test, I restored the comparison of types only, since the "get_device()" function doesn't fill the device index, which causes problems.

Now I'm observing another concerning issue related to this: on my GPU capable machine "test_predict.test_predict" fails on the same assertion (assert get_device(device) == model.policy.device) due to the same reason - the get_device function doesn't fill the index of the device. This behavior appears on the master branch as well...

        # Test detection of different shapes by the predict method
        model = model_class("MlpPolicy", env_id, device=device)
        # Check that the policy is on the right device
>       assert get_device(device) == model.policy.device
E       AssertionError: assert device(type='cuda') == device(type='cuda', index=0)

test_predict.py:49: AssertionError
FAILED               [ 43%]
tests\test_predict.py:32 (test_predict[cuda-Pendulum-v0-TD3])
device(type='cuda') != device(type='cuda', index=0)

@araffin
Copy link
Member

araffin commented Sep 15, 2020

doesn't fill the index of the device. This behavior appears on the master branch as well...

I see... but you could easily fix that by passing device="cuda:0" in the tests, no?

@leor-c
Copy link
Contributor Author

leor-c commented Sep 16, 2020

It doesn't fix the issue, unfortunately. get_device() still doesn't fill the index argument. If you look at the implementation of this function, you'll see that in some cases like "auto", get_device() uses a hard-coded string with no index.
Also, it's a different test case ("cuda:0" vs "cuda"). I think that the behavior of get_device() should be tested in both cases, but it is not directly related to my issue...
The test that fails on the master branch has nothing to do with my changes (obviously...), so I'm not sure it's right to fix it in this branch. I'd open a bug for this issue and dive into this, and probably fix get_device() as well.

@araffin
Copy link
Member

araffin commented Sep 16, 2020

It doesn't fix the issue, unfortunately. get_device() still doesn't fill the index argument. If you look at the implementation of this function, you'll see that in some cases like "auto", get_device() uses a hard-coded string with no index.
Also, it's a different test case ("cuda:0" vs "cuda"). I think that the behavior of get_device() should be tested in both cases, but it is not directly related to my issue...
The test that fails on the master branch has nothing to do with my changes (obviously...), so I'm not sure it's right to fix it in this branch. I'd open a bug for this issue and dive into this, and probably fix get_device() as well.

get_device("cuda:0") does fill the index but get_device("auto") don't (which is normal), however L148 of utils.py needs to be updated (using .type).

@araffin
Copy link
Member

araffin commented Sep 16, 2020

I think as it is fast and easy to fix, please update the tests to use .type in the assertion too (even though not directly related to the code you added) and also update L148 of utils.py ;).

leor-c and others added 2 commits September 20, 2020 13:47
…dated the assertion to consider only types of devices. Also corrected a related bug in 'get_device()' method.
Copy link
Member

@araffin araffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks =)

@araffin araffin merged commit f5104a5 into DLR-RM:master Sep 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] Allow to set the device when loading a model
2 participants