Principle:Haosulab ManiSkill GPU Parallelized Rollout

Field	Value
principle_name	Haosulab_ManiSkill_GPU_Parallelized_Rollout
overview	Collecting environment transitions in parallel across GPU-simulated environments for RL training data
domains	Simulation, Reinforcement_Learning, Robotics
last_updated	2026-02-15
related_pages	Implementation:Haosulab_ManiSkill_BaseEnv_Step_Reset

Overview

Description

GPU parallelized rollout collection is the data-gathering phase of the RL training loop. In this phase, the agent interacts with hundreds or thousands of environments simultaneously, producing a large batch of transitions (observation, action, reward, next_observation, done) that are used to compute policy gradients and update the agent's parameters.

The fundamental mechanism relies on batched step/reset operations: a single call to env.step(actions) advances all parallel environments by one timestep simultaneously on the GPU. The action tensor has shape (num_envs, action_dim), and all returned tensors (observations, rewards, termination flags) are similarly batched. This contrasts with CPU-based vectorization where each environment is stepped independently in separate processes.

Key aspects of GPU-parallelized rollout collection:

Batched Stepping: The step() method accepts a batch of actions and returns batched results. Internally, the GPU physics engine processes all environments in a single kernel launch, achieving massive throughput. A typical configuration with 512 environments and 50 steps per rollout collects 25,600 transitions per iteration.

Partial Reset Support: When individual environments terminate (due to task success, failure, or time limit), only those specific environments are reset while others continue. This is handled by the ManiSkillVectorEnv wrapper, which:

Detects which environments have terminated=True or truncated=True
Saves the final observation for correct value bootstrapping
Resets only those environments via reset(options={"env_idx": done_indices})
Returns the post-reset observation seamlessly

Rollout Buffer Management: During collection, transitions are stored in pre-allocated GPU tensors:

obs[step, env_idx] -- observations at each step for each environment
actions[step, env_idx] -- actions taken
logprobs[step, env_idx] -- log-probabilities of actions under the collection policy
rewards[step, env_idx] -- rewards received (optionally scaled)
dones[step, env_idx] -- done flags (logical OR of terminated and truncated)
values[step, env_idx] -- critic value estimates

These buffers remain on GPU memory throughout, avoiding costly CPU-GPU data transfers.

Final Value Bootstrapping: When an environment terminates mid-rollout due to truncation (time limit), the value estimate of the final observation must be used for correct return computation. The final_values tensor tracks these bootstrap values, which are incorporated during Generalized Advantage Estimation (GAE).

Usage

Use GPU-parallelized rollout collection during the data-gathering phase of on-policy RL algorithms (PPO, A2C, TRPO). The rollout phase alternates with the policy update phase in a training loop:

Rollout Phase: For num_steps timesteps, step all environments in parallel, collecting transitions into the rollout buffer
Advantage Computation: Compute GAE advantages and returns from the collected data
Update Phase: Optimize the policy using the collected batch (see PPO Policy Optimization)

The rollout collection operates with the agent in evaluation mode (agent.eval()) and with gradients disabled (torch.no_grad()) for maximum throughput.

Theoretical Basis

Vectorized Environment Stepping: The key insight behind GPU-parallelized rollout is that each environment instance is independent -- its physics state does not depend on other environments. This embarrassingly parallel structure maps naturally to GPU execution, where thousands of CUDA threads can simultaneously compute physics for different environments.

On-Policy Data Collection: PPO is an on-policy algorithm, meaning it requires fresh data collected by the current policy for each update. The rollout buffer stores exactly one rollout (of length num_steps) before it is consumed by the optimization step and discarded. This differs from off-policy methods (SAC, TD3) that maintain a replay buffer of historical data.

Partial Reset and Episode Boundary Handling: In vectorized environments, episode boundaries occur asynchronously across environments. Correct handling requires:

Storing the final observation of completed episodes separately from the post-reset observation
Computing the bootstrap value at truncation boundaries: V(s_final) for truncated episodes, 0 for terminated episodes
Using the done mask to correctly cut advantage propagation across episode boundaries

Action Clipping: Actions sampled from the Gaussian policy can exceed the environment's action space bounds. Before stepping the environment, actions are clamped to [action_space.low, action_space.high] to prevent the physics simulator from receiving invalid inputs.

Rollout Buffer Dimensions
Buffer	Shape	Device	Description
obs	`(num_steps, num_envs, obs_dim)`	GPU	Observations at each step
actions	`(num_steps, num_envs, act_dim)`	GPU	Actions taken at each step
logprobs	`(num_steps, num_envs)`	GPU	Action log-probabilities
rewards	`(num_steps, num_envs)`	GPU	Rewards received (scaled)
dones	`(num_steps, num_envs)`	GPU	Episode done flags
values	`(num_steps, num_envs)`	GPU	Critic value estimates
final_values	`(num_steps, num_envs)`	GPU	Bootstrap values at truncation

Related Pages

Implementation:Haosulab_ManiSkill_BaseEnv_Step_Reset -- The underlying batched step/reset operations
Principle:Haosulab_ManiSkill_Vectorized_Environment_Wrapping -- How environments are wrapped for auto-reset
Principle:Haosulab_ManiSkill_PPO_Agent_Architecture -- The agent that produces actions during rollout
Principle:Haosulab_ManiSkill_PPO_Policy_Optimization -- How collected rollout data is used for policy updates
Heuristic:Haosulab_ManiSkill_Physics_Solver_Tuning
Heuristic:Haosulab_ManiSkill_Num_Envs_Backend_Selection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment