Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector API - design considerations #32

Closed
RedTachyon opened this issue Oct 1, 2022 · 5 comments
Closed

Vector API - design considerations #32

RedTachyon opened this issue Oct 1, 2022 · 5 comments

Comments

@RedTachyon
Copy link
Member

I started trying to implement a new, more sane vector API. And then I quickly realized that it is, indeed, as messy as I could have expected, so the code will have to wait for some time.

Here I want to dump my thoughts about how this whole thing should/could look so that we have a discussion going.

Main desired outcome: we can use the vector API to easily create envs vectorized through either simple vectorization, or jax vmapping (or any other fast mechanism). This can give us huge performance improvements for some envs without relying on additional external libraries. For other envs, we default to Sync/Async/EnvPool?

Current situation: vectorization is only possible via Sync/Async, which is slow af, but very general. EnvPool (not officially supported) only works with some envs, but is faster. Other existing options are generally similar to Sync/Async, with their own quirks (e.g. ray in rllib, or the custom implementation in SB3)

The main complication is wrappers. If an environment provides its own optimized vectorized version, then we can't apply single-env wrappers to it. A nice solution would be an automatic conversion from a Wrapper to a VectorWrapper, but that seems either very tricky or impossible to do in a general case. Fortunately, many actual wrappers don't need that "general case" treatment.

The hope I see for this is switching to lambda wrappers, at least for some of the existing wrappers. ActionWrappers, ObservationWrappers and RewardWrappers can in principle be stateful, which requires some workarounds to map them over vectorized envs. With lambda wrappers, we can literally just do a map.

An element that I think will be crucial is different levels of optimization - existing third-party environments and wrappers should work exactly the same way, with the clunky subprocess vecenv approach, unless they do a few extra things to opt-in for the improvements.

Another rough edge might be autoreset. Currently this concept is barely present in gym, it's an optional wrapper for single envs, and in that scope it works fine. In a vectorized case, it's more important and a bit more complicated. If we don't have some sort of autoreset by default in vector envs, that makes them borderline useless for many envs (consider cartpole where the first env instance happens to take 10 steps, and the second takes 100 steps - if we only reset after both are terminated, we just lost 45% of the data)

While a vectorized autoreset is trivial with a subprocess-like vector env, that's not the case with e.g. numpy/jax acceleration. While I can see some hacks that maybe would kinda work to add it in some of these cases via wrapper, we might just have to add a requirement that the environment handles autoreset itself. Note that this wouldn't be a breaking change in env design - envs that don't have built-in autoreset can still use the standard vectorization. But if you want to use vectorized wrappers and the more efficient vectorization paradigm, you need to add it.

Finally, a question is - how much can we break? I'm not aware of any significant usage of gym.vector, though I know it is used at least sometimes. Ideally I'd like to keep the outside API as similar as possible, perhaps even exactly the same (with additional capabilities). But can we change some of the internal semantics that are in principle exposed to the public, but are also just one of the few remaining relics of the past? As I recall, we want to do the vector revamp before 1.0, which is good, because after 1.0 we have to be very careful about breaking stuff.


Below I'm including a braindump of my semi-structured thoughts on this, just to have it recorded here with some additional details (most of this was mentioned above):

  1. Each environment can implement its own VectorEnv, or use built-in Sync/Async
    1. If implements its own, we can’t use individual wrappers - there’s no instance of gym.Env we can actually apply them to
    2. If uses built-in, then the VectorEnv contains several instances of gym.Env, to which individual wrappers are applied
  2. Each (?) wrapper should have single and vector mode - need to convert single to multi
    1. Should be trivial for:
      1. Observation wrapper - map observation
      2. Reward wrapper - map reward
      3. Action wrapper - map action
      4. EDIT - it's actually not trivial, needs lambda wrappers
    2. Need to have selectable/automatic optimization
      1. Jax envs/wrappers → vmap
      2. Pure numpy → nothing? or np vfunc?
      3. Generic → np.array(map) or np.array([... for o in obs])
      4. Settable in the wrapper? self.optimization: Literal["numpy"] | Literal["jax"] | None
    3. Some wrappers can’t be vectorized
      1. Atari preprocessing - needs to reset envs asynchronously
      2. Autoreset in general?
        1. We can require optimized envs to autoreset internally. Third-party envs will default to the regular vectorization, and they can opt-in for this

Issues in the meantime:

  1. OrderEnforcing (and others?) accept arguments in render
  2. Atari wrapper
  3. Several typing errors in vector API

Questions:

  1. Can we break the whole vector API? Does anyone use it?
    1. SB3 and rllib def have their own
  2. (check myself) do we want vector API before 1.0?
@pseudo-rnd-thoughts
Copy link
Member

pseudo-rnd-thoughts commented Oct 3, 2022

An additional change would be when reset is called.
Currently, the Gym vector API implements it such that if a step is terminated or truncated, reset is called within the step. The original terminating step observation is added to the info as "final_obs".
A proposed solution is noted in this comment openai/gym#2279 (comment) which would be necessary if we wanted EnvPool to work alongside the Gymnasium vector API

@MarcCote
Copy link

MarcCote commented Nov 3, 2022

Hi, would it make sense for VectorEnv.reset(...) to accept a list of options similar to seeds, i.e. one options dict per env?

@sven1977
Copy link

Thanks for your thoughts on this @MarcCote ! Sven here from the RLlib team :)

  • Some update from RLlib wrt vectorization: We are actively looking into fully deprecating all of RLlib's in-house env APIs (VectorEnv, BaseEnv, etc.., except for MultiAgentEnv), in favor of just supporting gymnasium (and its own vector option).
  • We are currently a few days away from merging our gymnasium support PR.

On the autoreset issue you mentioned: I would prefer autoreset (similar to how deepmind has always done it), but then the per-episode seeding (and option'ing) would not be accessible anymore. If per-episode seeding and option'ing is really essential for users, then we won't be able to avoid having an additional reset_at(vector_idx: int, *, seed=None, options=None) method (like we will have in RLlib after moving to gymnasium in a few days).

@pseudo-rnd-thoughts
Copy link
Member

pseudo-rnd-thoughts commented Dec 20, 2022

Thanks for your thoughts @sven1977

Im unfamiliar with how deepmind has achieved environment vectorisation, could you provide a short summary of the difference between deepminds approach and openai's approach (or envpool)

For being able to set the autoreset seed, this is an interesting idea that I hadn't heard of before. I don't think we will include this in the VectorEnv definition as we are interested in users being able to create custom vector environments which has the minimal requirements.
However, like @MarcCote proposal, I would be interested in adding this to a SyncVectorEnv or AsyncVectorEnv special parameter.

The current reset type definition for those classes is reset(self, seed: Union[int, list[int]], options: Optional[dict[str, Any]] -> tuple[ObsType, dict[str, Any]]. As we already have list[int] for setting the initial reset seeds of the individual environments, I'm unsure of what other options we have, using a callable is the best option I can think of other than adding another function. Any other thoughts?

@pseudo-rnd-thoughts
Copy link
Member

Discussion on the step function call order, which if adopted is breaking backward compatibility change for both the AutoResetWrapper and new SyncVectorEnv and AsyncVectorEnv. The proposed function call if identical to the current implementation within EnvPool.This does not necessarily change the VectorEnv implementation.

@RedTachyon @vwxyzjn @araffin Would be interested in your thoughts

Considering a very simple training loop

import gymnasium as gym

env = gym.make("CartPole-v1")

obs, info = env.reset()

for training_step in range(1_000):
    action = policy(obs)
    next_obs, reward, terminated, truncated, next_info = env.step(action)
    
    replay_buffer.append((obs, info, action, next_obs, reward, terminated, truncated))
    
    if terminated or truncated:
        obs, info = env.reset()
    else:
        obs, info = next_obs, next_info

The AutoReset wrapper (and the Vector implementations) look very similar to this implementation. While the auto resetting removes the need for one if statement, I believe it requires adding a new if statement.

import gymnasium as gym

env = gym.make("CartPole-v1")
env = gym.wrappers.AutoResetWrapper(env)

obs, info = env.reset()

for training_step in range(1_000):
    action = policy(obs)
    next_obs, reward, terminated, truncated, next_info = env.step(action)
    
    if terminated or truncated:
        replay_buffer.append((obs, info, action, next_info["final_observation"], reward, terminated, truncated))
    else:
        replay_buffer.append((obs, info, action, next_obs, reward, terminated, truncated))
     
    obs, info = next_obs, next_info

I propose the following change to the AutoReset wrapper and vector implementations. The difference is when reset is called, in the current implementation, within the step call, if the resulting data has an episode ending then reset is called within the same step with the old obs and info packaged into the new info from reset. This PR changes it such that the reset will happen in the next step following the episode ending.

import gymnasium as gym

env = gym.make("CartPole-v1")
env = gym.experimental.wrappers.AutoReset(env)

obs, info = env.reset()

for training_step in range(1_000):
    action = policy(obs)
    next_obs, reward, terminated, truncated, next_info = env.step(action)
    
    replay_buffer.append((obs, info, action, next_obs, reward, terminated, truncated))
    obs, info = next_obs, next_info

Advantages

  1. Simplifies training code, currently, you need to check if the sub-environment's episode ended then collect the last observation and info (for truncated cases) to save to the replay buffer. With a similar change to vector, users shouldn't need to use the info["final_observation"][env_num]

Disadvantage

  1. This is large breaking change for old training code. This could be fixed through renaming the current implementation as LegacySyncVectorEnv.
  2. If an episode has terminated then there is a "dead" observation where the action has to be ignored
  3. It is only possible to convert "old" to "new" and not "new" to "old"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants