Reset PPO `policy.log_std` when loading previously saved model #155

kyrwilliams · 2020-09-03T14:19:09Z

When performing curriculum learning, being able to reset the ppo policy.log_std between training cycles would be nice. The following code will produce an error:

# Define RL model: randomized init
policy_kwargs = dict(log_std_init=0)
model = PPO(MlpPolicy, env=env1, policy_kwargs=policy_kwargs) 

# Learn and save
model.learn(total_timesteps=500000, tb_log_name='ppo')
model.save("ppo_model")

# Define RL model: preload network parameters
policy_kwargs = dict(log_std_init=-0.5)    
model = PPO.load(load_path="log_dir\ppo_1\ppo_model", env=env2, policy_kwargs=policy_kwargs)

# Learn and save again
model.learn(total_timesteps=500000, tb_log_name='ppo')
model.save("ppo_model")

Describe the bug
ValueError: The specified policy kwargs do not equal the stored policy kwargs.Stored kwargs

This is error is thrown because log_std_init differs between the two training cycles.

The text was updated successfully, but these errors were encountered:

Miffyli · 2020-09-03T14:38:29Z

@araffin
Probably just me being dumdum again, but why exactly does the code enforce that provided policy_kwargs should match the saved when provided? I do not understand the argument's purpose if it can only be None or whatever is stored in file. Allowing it to change parameters would allow this kind of tinkering, but should still come with a big warning "using different policy_kwargs as stored! This may crash the code or result in undefined behaviour. Tread carefully!".

@kyrwilliams
#138 is working on simplifying getting/setting network parameters, but currently bit on hold as I am busy with deadlines.
Meanwhile you could try storing the network parameters with model.policy.state_dict() and then loading them as with any PyTorch model.

kyrwilliams · 2020-09-08T19:58:10Z

Thanks @Miffyli! So, a couple things:

(1) I attempted the pytorch save/load methods, manually reseting the log_std values with:

model.policy.log_std = th.nn.Parameter(th.tensor([-0.5, -0.5, -0.5], device='cuda:0', requires_grad=True))
model.learn(total_timesteps=500000, tb_log_name='ppo')

But unfortunately this just locked model.policy.log_std at -0.5 throughout the entire training.

(2) I found the following crude method DID work, since it uses the PPO class's .load method which apparently updates the model in a specific way:

model = PPO.load(load_path="log_dir\ppo_1\ppo_model", env=env2) # load saved model
model.policy.log_std=th.nn.Parameter(th.tensor([-0.5, -0.5, -0.5], device='cuda:0', requires_grad=True)) # reset log_std
model.save("ppo_model_temp") # save this adjusted model
model = PPO.load(load_path="ppo_model_temp", env=env2) # load the adjusted model
model.learn(total_timesteps=500000, tb_log_name='ppo') # learn

This approach successfully reset the log_std to -0.5 and allowed the optimizer to adjust it during training.

araffin · 2020-09-10T15:11:23Z

Hello,

Probably just me being dumdum again, but why exactly does the code enforce that provided policy_kwargs should match the saved when provided?

We do that because you really need to know what is happening when you change those arguments between saving and loading.
This prevent most users from unexpected behavior.
For user that wants to change those anyway, they can do so after loading as @kyrwilliams mentioned, but it requires a good understanding of each RL algorithm.
You can find an example with SAC (when using gSDE) here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/e12a3019b57e11c876b6f875c5ff8c79a168c187/train.py#L569
Also changing log_std in the policy kwargs won't work as the value will be overwritten when loading the saved state dict.

(1) I attempted the pytorch save/load methods, manually reseting the log_std values with:

You may need to register that parameter too and also check if it is present in the optimizer (which I assume is not the case given the result).

Miffyli · 2020-09-12T11:52:50Z

We do that because you really need to know what is happening when you change those arguments between saving and loading.
This prevent most users from unexpected behavior.

Hmm ok. I think we could remove the parameter all-together in that case. I do not see why you would want to provide same parameters again which are already stored, and the only other option is to provide None. Another option would be to raise warnings when you change parameters, but then again doing modifications the way done here is not that difficult either.

araffin · 2020-09-15T12:01:59Z

I think we could remove the parameter all-together in that case.

?
not sure to get what you mean...
how do you do then when you have a custom policy architecture?

Miffyli · 2020-09-15T12:04:53Z

Wouldn't that information (the custom policy pickled) be stored in the saved model as well? Or does it skip saving policy_kwargs to the file when using a custom policy?

araffin · 2020-09-20T16:38:56Z

Wouldn't that information (the custom policy pickled) be stored in the saved model as well?

Yes, it is. However, if you want to continue training (with the zoo for instance), and you tried multiple configurations, checking the kwargs allow you to know if the saved model has the network architecture that you expect.

Miffyli added the enhancement New feature or request label Sep 3, 2020

araffin added question Further information is requested and removed enhancement New feature or request labels Sep 10, 2020

araffin closed this as completed Oct 11, 2020

araffin mentioned this issue Nov 19, 2020

Standard Deviation Schedule #231

Closed

araffin mentioned this issue Feb 22, 2021

[Feature request] Support entropy coefficient schedule in PPO #316

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset PPO `policy.log_std` when loading previously saved model #155

Reset PPO `policy.log_std` when loading previously saved model #155

kyrwilliams commented Sep 3, 2020

Miffyli commented Sep 3, 2020

kyrwilliams commented Sep 8, 2020 •

edited

Loading

araffin commented Sep 10, 2020

Miffyli commented Sep 12, 2020

araffin commented Sep 15, 2020

Miffyli commented Sep 15, 2020

araffin commented Sep 20, 2020

Reset PPO policy.log_std when loading previously saved model #155

Reset PPO policy.log_std when loading previously saved model #155

Comments

kyrwilliams commented Sep 3, 2020

Miffyli commented Sep 3, 2020

kyrwilliams commented Sep 8, 2020 • edited Loading

araffin commented Sep 10, 2020

Miffyli commented Sep 12, 2020

araffin commented Sep 15, 2020

Miffyli commented Sep 15, 2020

araffin commented Sep 20, 2020

Reset PPO `policy.log_std` when loading previously saved model #155

Reset PPO `policy.log_std` when loading previously saved model #155

kyrwilliams commented Sep 8, 2020 •

edited

Loading