Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HER #8

Closed
araffin opened this issue May 9, 2020 · 16 comments · Fixed by #120
Closed

Implement HER #8

araffin opened this issue May 9, 2020 · 16 comments · Fixed by #120
Labels
enhancement New feature or request
Milestone

Comments

@araffin
Copy link
Member

araffin commented May 9, 2020

EDIT: we might want to support Dict obs for VecNormalize (maybe another issue)

@araffin araffin added the enhancement New feature or request label May 9, 2020
@araffin araffin added this to the v1.0 milestone May 9, 2020
@ferreirajoaouerj
Copy link

Maybe worth it to implement similar to baselines2. Create an environment wrapper to return concatenated states and goals and then wrap the buffer, so that goal relabeling is applied at the end of the episode.

@araffin
Copy link
Member Author

araffin commented May 11, 2020

Maybe worth it to implement similar to baselines2. Create an environment wrapper to return concatenated states and goals and then wrap the buffer, so that goal relabeling is applied at the end of the episode.

In fact, I would prefer a separate algorithm that does not require a wrapper, even though it means a bit of code duplication. The SB2 solution works but is quite messy and requires a lot of data transformation.

@arjun-kg
Copy link

I'm working on something similar. Can I take a shot at this? With DDPG (and maybe SAC)

@araffin
Copy link
Member Author

araffin commented May 18, 2020

Yes you can, but you just have to assume the model derives from the off policy class and therefore has a replay buffer.
it should not be specific to td3/dqn/sac.

i may also have a student on that one (cf roadmap) but her code needs to be adapted a bit. i would like also to compare performances when the transitions are added at sampling time (cf openai baselines code).

@arjun-kg
Copy link

arjun-kg commented Jun 3, 2020

I've created a PR for a HER implementation (#42). It's basically reusing elements from the existing SB implementation (with some minor edits). It works for the existing off-policy algos. Please check if this is usable. I'll take a shot at removing wrappers and minimizing data transformations next. This is my first PR, so I apologize if some things are not proper and do let me know if there is additional work to be done.

@arjun-kg
Copy link

arjun-kg commented Jun 3, 2020

Regarding adding HER transitions at sampling time, wouldn't that mean that the samples are more correlated than before, since HER samples would come from the same episode as the original transitions we sample. This, as opposed to uncorrelated samples while drawing randomly from a buffer filled with original + HER

@araffin
Copy link
Member Author

araffin commented Jun 3, 2020

Thank you for the PR.
I wanted to avoid doing an implementation as in SB2... Also it seems that you used the wrong branch as base. Maybe wait for the dqn to be merged first.

@araffin
Copy link
Member Author

araffin commented Jun 3, 2020

As mentioned before, I've got a student (@megan-klaiber ) that implemented HER (minimal implementation with an old version of SB3) as a programming test a while ago. If she agrees, we could use this implementation as a base.
Unfortunately, her contract did not start yet...

@araffin
Copy link
Member Author

araffin commented Jun 3, 2020

Regarding adding HER transitions at sampling time, wouldn't that mean that the samples are more correlated than before, since HER samples would come from the same episode as the original transitions we sample. This, as opposed to uncorrelated samples while drawing randomly from a buffer filled with original + HER

Take a look at Baselines implementation. The two should be equivalent

@megan-klaiber
Copy link
Collaborator

As mentioned before, I've got a student (@megan-klaiber ) that implemented HER (minimal implementation with an old version of SB3) as a programming test a while ago. If she agrees, we could use this implementation as a base.
Unfortunately, her contract did not start yet...

Yes, sure you can use the code. Unfortunately, I haven't had time to work on it yet.

@arjun-kg
Copy link

arjun-kg commented Jun 4, 2020

Thank you for the PR.
I wanted to avoid doing an implementation as in SB2... Also it seems that you used the wrong branch as base. Maybe wait for the dqn to be merged first.

Didn't realize this. Used master as base and updated the code and PR now

@arjun-kg
Copy link

arjun-kg commented Jun 4, 2020

As mentioned before, I've got a student (@megan-klaiber ) that implemented HER (minimal implementation with an old version of SB3) as a programming test a while ago. If she agrees, we could use this implementation as a base.
Unfortunately, her contract did not start yet...

Yes, sure you can use the code. Unfortunately, I haven't had time to work on it yet.

Thanks. I'll work on removing wrappers and reducing data transformations in the meantime, and continue updating. Will also look at baselines implementation for sampling

@araffin
Copy link
Member Author

araffin commented Jun 5, 2020

Thanks. I'll work on removing wrappers and reducing data transformations in the meantime, and continue updating. Will also look at baselines implementation for sampling

I'll try to push @megan-klaiber implementation next week, but I won't take at look at your PR before #28 and #35 are merged.
In the meantime, make sure to read the contributing guide and that all the tests passes.

@araffin
Copy link
Member Author

araffin commented Jun 8, 2020

You can find attached @megan-klaiber implementation. What I have in mind is a mix between SB2 implementation and her implementation.
To be more precise:

  • I would minimize transformations and avoid using a wrapper for the buffer, even if it means a bit of code duplication
  • I would keep the enum that was present in SB2 to define the type of goals
  • for the env wrapper, I would prefer a VecEnvWrapper instead of a gym.Wrapper, this would allow multiprocessing in the future. Also, once with have dict support for observation space (cf roadmap Roadmap to Stable-Baselines3 V1.0 #1 ), we will be able to remove that wrapper.

But again, please wait that #28 and #35 are merged before requesting review (that does not mean you cannot work on it).

her_megan_solution.zip

@arjun-kg
Copy link

arjun-kg commented Jun 9, 2020

Thank you for the code. Do you have a timeline in mind as to when you want to get HER done? I will look at improving it until then.

@araffin
Copy link
Member Author

araffin commented Jun 9, 2020

There is not fix timeline but at some point (in ~1 month and a half), we will have to use it. So, I may help you finish the implementation to get things done at that time.

@megan-klaiber megan-klaiber mentioned this issue Jul 23, 2020
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants