Principle:Haosulab ManiSkill GPU Parallelized Rollout
| Field | Value |
|---|---|
| principle_name | Haosulab_ManiSkill_GPU_Parallelized_Rollout |
| overview | Collecting environment transitions in parallel across GPU-simulated environments for RL training data |
| domains | Simulation, Reinforcement_Learning, Robotics |
| last_updated | 2026-02-15 |
| related_pages | Implementation:Haosulab_ManiSkill_BaseEnv_Step_Reset |
Overview
Description
GPU parallelized rollout collection is the data-gathering phase of the RL training loop. In this phase, the agent interacts with hundreds or thousands of environments simultaneously, producing a large batch of transitions (observation, action, reward, next_observation, done) that are used to compute policy gradients and update the agent's parameters.
The fundamental mechanism relies on batched step/reset operations: a single call to env.step(actions) advances all parallel environments by one timestep simultaneously on the GPU. The action tensor has shape (num_envs, action_dim), and all returned tensors (observations, rewards, termination flags) are similarly batched. This contrasts with CPU-based vectorization where each environment is stepped independently in separate processes.
Key aspects of GPU-parallelized rollout collection:
Batched Stepping: The step() method accepts a batch of actions and returns batched results. Internally, the GPU physics engine processes all environments in a single kernel launch, achieving massive throughput. A typical configuration with 512 environments and 50 steps per rollout collects 25,600 transitions per iteration.
Partial Reset Support: When individual environments terminate (due to task success, failure, or time limit), only those specific environments are reset while others continue. This is handled by the ManiSkillVectorEnv wrapper, which:
- Detects which environments have
terminated=Trueortruncated=True - Saves the final observation for correct value bootstrapping
- Resets only those environments via
reset(options={"env_idx": done_indices}) - Returns the post-reset observation seamlessly
Rollout Buffer Management: During collection, transitions are stored in pre-allocated GPU tensors:
obs[step, env_idx]-- observations at each step for each environmentactions[step, env_idx]-- actions takenlogprobs[step, env_idx]-- log-probabilities of actions under the collection policyrewards[step, env_idx]-- rewards received (optionally scaled)dones[step, env_idx]-- done flags (logical OR of terminated and truncated)values[step, env_idx]-- critic value estimates
These buffers remain on GPU memory throughout, avoiding costly CPU-GPU data transfers.
Final Value Bootstrapping: When an environment terminates mid-rollout due to truncation (time limit), the value estimate of the final observation must be used for correct return computation. The final_values tensor tracks these bootstrap values, which are incorporated during Generalized Advantage Estimation (GAE).
Usage
Use GPU-parallelized rollout collection during the data-gathering phase of on-policy RL algorithms (PPO, A2C, TRPO). The rollout phase alternates with the policy update phase in a training loop:
- Rollout Phase: For
num_stepstimesteps, step all environments in parallel, collecting transitions into the rollout buffer - Advantage Computation: Compute GAE advantages and returns from the collected data
- Update Phase: Optimize the policy using the collected batch (see PPO Policy Optimization)
The rollout collection operates with the agent in evaluation mode (agent.eval()) and with gradients disabled (torch.no_grad()) for maximum throughput.
Theoretical Basis
Vectorized Environment Stepping: The key insight behind GPU-parallelized rollout is that each environment instance is independent -- its physics state does not depend on other environments. This embarrassingly parallel structure maps naturally to GPU execution, where thousands of CUDA threads can simultaneously compute physics for different environments.
On-Policy Data Collection: PPO is an on-policy algorithm, meaning it requires fresh data collected by the current policy for each update. The rollout buffer stores exactly one rollout (of length num_steps) before it is consumed by the optimization step and discarded. This differs from off-policy methods (SAC, TD3) that maintain a replay buffer of historical data.
Partial Reset and Episode Boundary Handling: In vectorized environments, episode boundaries occur asynchronously across environments. Correct handling requires:
- Storing the final observation of completed episodes separately from the post-reset observation
- Computing the bootstrap value at truncation boundaries:
V(s_final)for truncated episodes,0for terminated episodes - Using the done mask to correctly cut advantage propagation across episode boundaries
Action Clipping: Actions sampled from the Gaussian policy can exceed the environment's action space bounds. Before stepping the environment, actions are clamped to [action_space.low, action_space.high] to prevent the physics simulator from receiving invalid inputs.
| Buffer | Shape | Device | Description |
|---|---|---|---|
| obs | (num_steps, num_envs, obs_dim) |
GPU | Observations at each step |
| actions | (num_steps, num_envs, act_dim) |
GPU | Actions taken at each step |
| logprobs | (num_steps, num_envs) |
GPU | Action log-probabilities |
| rewards | (num_steps, num_envs) |
GPU | Rewards received (scaled) |
| dones | (num_steps, num_envs) |
GPU | Episode done flags |
| values | (num_steps, num_envs) |
GPU | Critic value estimates |
| final_values | (num_steps, num_envs) |
GPU | Bootstrap values at truncation |
Related Pages
- Implementation:Haosulab_ManiSkill_BaseEnv_Step_Reset -- The underlying batched step/reset operations
- Principle:Haosulab_ManiSkill_Vectorized_Environment_Wrapping -- How environments are wrapped for auto-reset
- Principle:Haosulab_ManiSkill_PPO_Agent_Architecture -- The agent that produces actions during rollout
- Principle:Haosulab_ManiSkill_PPO_Policy_Optimization -- How collected rollout data is used for policy updates
- Heuristic:Haosulab_ManiSkill_Physics_Solver_Tuning
- Heuristic:Haosulab_ManiSkill_Num_Envs_Backend_Selection