Workflow:Haosulab ManiSkill RL Training with PPO
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Robotics_Simulation, GPU_Parallelization |
| Last Updated | 2026-02-15 12:00 GMT |
Overview
End-to-end process for training reinforcement learning agents on ManiSkill GPU-parallelized robotics environments using Proximal Policy Optimization (PPO).
Description
This workflow covers the complete pipeline for training RL policies in ManiSkill's GPU-parallelized simulation framework. It leverages ManiSkill's ability to run hundreds of environments simultaneously on GPU via PhysX CUDA, enabling high-throughput training. The process covers environment configuration, vectorized environment wrapping, PPO agent setup with actor-critic networks, GPU-parallelized rollout collection, policy optimization, periodic evaluation, and model checkpointing. The trained policy maps state observations to continuous robot actions using a neural network trained with clipped surrogate objectives.
Usage
Execute this workflow when you need to train a manipulation or locomotion policy from scratch using reinforcement learning, have access to an NVIDIA GPU with sufficient VRAM, and want to leverage GPU parallelism for fast training. This is appropriate when no demonstration data is available and the task has a well-defined reward function.
Execution Steps
Step 1: Environment Configuration
Select and configure a ManiSkill task environment with the appropriate observation mode, control mode, and simulation backend. The environment is created via the Gymnasium API with ManiSkill-specific parameters including the number of parallel environments, robot control mode (e.g., pd_joint_delta_pos), and rendering options. GPU simulation backend is selected for maximum training throughput.
Key considerations:
- Choose the appropriate control mode for your robot and task (e.g., pd_joint_delta_pos for manipulation)
- Set num_envs to a high value (e.g., 512 or 1024) for GPU-parallelized training
- Configure observation mode based on whether you need state-only or visual observations
- Set reconfiguration_freq if the task requires object randomization between episodes
Step 2: Vectorized Environment Wrapping
Wrap the environment with ManiSkill's GPU-optimized vector environment wrapper and any additional Gymnasium wrappers needed for training. This includes the ManiSkillVectorEnv wrapper for GPU batching, FlattenActionSpaceWrapper for normalizing action spaces, and RecordEpisode for capturing evaluation videos.
Key considerations:
- Use ManiSkillVectorEnv for GPU-parallelized environments
- Apply FlattenActionSpaceWrapper if action spaces are complex
- Set up separate evaluation environments with fewer parallel instances
- Configure video recording for evaluation episodes
Step 3: PPO Agent Initialization
Initialize the actor-critic neural network architecture and PPO hyperparameters. The agent consists of a shared feature extractor feeding into separate policy (actor) and value (critic) heads. Networks are initialized with orthogonal initialization and moved to the appropriate device.
Key considerations:
- Use orthogonal weight initialization for stable training
- Configure separate learning rates for actor and critic if needed
- Set appropriate values for gamma (discount factor), GAE lambda, and clip range
- Ensure network output dimensions match the environment action space
Step 4: GPU Parallelized Rollout Collection
Collect experience by running the current policy across all parallel environments simultaneously. Each rollout gathers a fixed number of steps (e.g., 50) from all environments, producing a large batch of (state, action, reward, next_state) transitions. ManiSkill's partial reset mechanism allows finished environments to reset without waiting for all environments to complete.
Key considerations:
- Partial reset is critical for GPU training efficiency, allowing environments to reset independently
- Advantage estimation uses Generalized Advantage Estimation (GAE) across the rollout
- All data remains on GPU as torch tensors to avoid CPU-GPU transfer bottlenecks
- Finite horizon correction should be applied for truncated episodes
Step 5: Policy Optimization
Update the policy and value networks using the PPO clipped surrogate objective. The collected rollout data is split into minibatches and used for multiple epochs of gradient updates. The clipped objective prevents destructively large policy updates while the value function loss is also clipped for stability.
Key considerations:
- Use multiple update epochs (e.g., 8) per rollout for sample efficiency
- Apply gradient clipping (e.g., max_grad_norm=0.5) for training stability
- Monitor the KL divergence and clip fraction to detect training issues
- Entropy bonus encourages exploration in the early stages of training
Step 6: Evaluation and Checkpointing
Periodically evaluate the current policy on a separate set of environments and save model checkpoints. Evaluation uses deterministic action selection (mean of the policy distribution) and records success rates, episode returns, and episode lengths. Videos of evaluation episodes are captured for qualitative assessment.
Key considerations:
- Evaluate at regular intervals (e.g., every 25 rollout updates)
- Use separate evaluation environments with full episode resets
- Track success rate as the primary metric for manipulation tasks
- Save the best-performing checkpoint based on evaluation success rate