Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haosulab ManiSkill GPU Parallelized Rollout

From Leeroopedia
Field Value
principle_name Haosulab_ManiSkill_GPU_Parallelized_Rollout
overview Collecting environment transitions in parallel across GPU-simulated environments for RL training data
domains Simulation, Reinforcement_Learning, Robotics
last_updated 2026-02-15
related_pages Implementation:Haosulab_ManiSkill_BaseEnv_Step_Reset

Overview

Description

GPU parallelized rollout collection is the data-gathering phase of the RL training loop. In this phase, the agent interacts with hundreds or thousands of environments simultaneously, producing a large batch of transitions (observation, action, reward, next_observation, done) that are used to compute policy gradients and update the agent's parameters.

The fundamental mechanism relies on batched step/reset operations: a single call to env.step(actions) advances all parallel environments by one timestep simultaneously on the GPU. The action tensor has shape (num_envs, action_dim), and all returned tensors (observations, rewards, termination flags) are similarly batched. This contrasts with CPU-based vectorization where each environment is stepped independently in separate processes.

Key aspects of GPU-parallelized rollout collection:

Batched Stepping: The step() method accepts a batch of actions and returns batched results. Internally, the GPU physics engine processes all environments in a single kernel launch, achieving massive throughput. A typical configuration with 512 environments and 50 steps per rollout collects 25,600 transitions per iteration.

Partial Reset Support: When individual environments terminate (due to task success, failure, or time limit), only those specific environments are reset while others continue. This is handled by the ManiSkillVectorEnv wrapper, which:

  • Detects which environments have terminated=True or truncated=True
  • Saves the final observation for correct value bootstrapping
  • Resets only those environments via reset(options={"env_idx": done_indices})
  • Returns the post-reset observation seamlessly

Rollout Buffer Management: During collection, transitions are stored in pre-allocated GPU tensors:

  • obs[step, env_idx] -- observations at each step for each environment
  • actions[step, env_idx] -- actions taken
  • logprobs[step, env_idx] -- log-probabilities of actions under the collection policy
  • rewards[step, env_idx] -- rewards received (optionally scaled)
  • dones[step, env_idx] -- done flags (logical OR of terminated and truncated)
  • values[step, env_idx] -- critic value estimates

These buffers remain on GPU memory throughout, avoiding costly CPU-GPU data transfers.

Final Value Bootstrapping: When an environment terminates mid-rollout due to truncation (time limit), the value estimate of the final observation must be used for correct return computation. The final_values tensor tracks these bootstrap values, which are incorporated during Generalized Advantage Estimation (GAE).

Usage

Use GPU-parallelized rollout collection during the data-gathering phase of on-policy RL algorithms (PPO, A2C, TRPO). The rollout phase alternates with the policy update phase in a training loop:

  1. Rollout Phase: For num_steps timesteps, step all environments in parallel, collecting transitions into the rollout buffer
  2. Advantage Computation: Compute GAE advantages and returns from the collected data
  3. Update Phase: Optimize the policy using the collected batch (see PPO Policy Optimization)

The rollout collection operates with the agent in evaluation mode (agent.eval()) and with gradients disabled (torch.no_grad()) for maximum throughput.

Theoretical Basis

Vectorized Environment Stepping: The key insight behind GPU-parallelized rollout is that each environment instance is independent -- its physics state does not depend on other environments. This embarrassingly parallel structure maps naturally to GPU execution, where thousands of CUDA threads can simultaneously compute physics for different environments.

On-Policy Data Collection: PPO is an on-policy algorithm, meaning it requires fresh data collected by the current policy for each update. The rollout buffer stores exactly one rollout (of length num_steps) before it is consumed by the optimization step and discarded. This differs from off-policy methods (SAC, TD3) that maintain a replay buffer of historical data.

Partial Reset and Episode Boundary Handling: In vectorized environments, episode boundaries occur asynchronously across environments. Correct handling requires:

  • Storing the final observation of completed episodes separately from the post-reset observation
  • Computing the bootstrap value at truncation boundaries: V(s_final) for truncated episodes, 0 for terminated episodes
  • Using the done mask to correctly cut advantage propagation across episode boundaries

Action Clipping: Actions sampled from the Gaussian policy can exceed the environment's action space bounds. Before stepping the environment, actions are clamped to [action_space.low, action_space.high] to prevent the physics simulator from receiving invalid inputs.

Rollout Buffer Dimensions
Buffer Shape Device Description
obs (num_steps, num_envs, obs_dim) GPU Observations at each step
actions (num_steps, num_envs, act_dim) GPU Actions taken at each step
logprobs (num_steps, num_envs) GPU Action log-probabilities
rewards (num_steps, num_envs) GPU Rewards received (scaled)
dones (num_steps, num_envs) GPU Episode done flags
values (num_steps, num_envs) GPU Critic value estimates
final_values (num_steps, num_envs) GPU Bootstrap values at truncation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment