Workflow:Haosulab ManiSkill RL Training with PPO

Knowledge Sources	ManiSkill ManiSkill RL Setup ManiSkill RL Baselines
Domains	Reinforcement_Learning, Robotics_Simulation, GPU_Parallelization
Last Updated	2026-02-15 12:00 GMT

Overview

End-to-end process for training reinforcement learning agents on ManiSkill GPU-parallelized robotics environments using Proximal Policy Optimization (PPO).

Description

This workflow covers the complete pipeline for training RL policies in ManiSkill's GPU-parallelized simulation framework. It leverages ManiSkill's ability to run hundreds of environments simultaneously on GPU via PhysX CUDA, enabling high-throughput training. The process covers environment configuration, vectorized environment wrapping, PPO agent setup with actor-critic networks, GPU-parallelized rollout collection, policy optimization, periodic evaluation, and model checkpointing. The trained policy maps state observations to continuous robot actions using a neural network trained with clipped surrogate objectives.

Usage

Execute this workflow when you need to train a manipulation or locomotion policy from scratch using reinforcement learning, have access to an NVIDIA GPU with sufficient VRAM, and want to leverage GPU parallelism for fast training. This is appropriate when no demonstration data is available and the task has a well-defined reward function.

Execution Steps

Step 1: Environment Configuration

Select and configure a ManiSkill task environment with the appropriate observation mode, control mode, and simulation backend. The environment is created via the Gymnasium API with ManiSkill-specific parameters including the number of parallel environments, robot control mode (e.g., pd_joint_delta_pos), and rendering options. GPU simulation backend is selected for maximum training throughput.

Key considerations:

Choose the appropriate control mode for your robot and task (e.g., pd_joint_delta_pos for manipulation)
Set num_envs to a high value (e.g., 512 or 1024) for GPU-parallelized training
Configure observation mode based on whether you need state-only or visual observations
Set reconfiguration_freq if the task requires object randomization between episodes

Step 2: Vectorized Environment Wrapping

Wrap the environment with ManiSkill's GPU-optimized vector environment wrapper and any additional Gymnasium wrappers needed for training. This includes the ManiSkillVectorEnv wrapper for GPU batching, FlattenActionSpaceWrapper for normalizing action spaces, and RecordEpisode for capturing evaluation videos.

Key considerations:

Use ManiSkillVectorEnv for GPU-parallelized environments
Apply FlattenActionSpaceWrapper if action spaces are complex
Set up separate evaluation environments with fewer parallel instances
Configure video recording for evaluation episodes

Step 3: PPO Agent Initialization

Initialize the actor-critic neural network architecture and PPO hyperparameters. The agent consists of a shared feature extractor feeding into separate policy (actor) and value (critic) heads. Networks are initialized with orthogonal initialization and moved to the appropriate device.

Key considerations:

Use orthogonal weight initialization for stable training
Configure separate learning rates for actor and critic if needed
Set appropriate values for gamma (discount factor), GAE lambda, and clip range
Ensure network output dimensions match the environment action space

Step 4: GPU Parallelized Rollout Collection

Collect experience by running the current policy across all parallel environments simultaneously. Each rollout gathers a fixed number of steps (e.g., 50) from all environments, producing a large batch of (state, action, reward, next_state) transitions. ManiSkill's partial reset mechanism allows finished environments to reset without waiting for all environments to complete.

Key considerations:

Partial reset is critical for GPU training efficiency, allowing environments to reset independently
Advantage estimation uses Generalized Advantage Estimation (GAE) across the rollout
All data remains on GPU as torch tensors to avoid CPU-GPU transfer bottlenecks
Finite horizon correction should be applied for truncated episodes

Step 5: Policy Optimization

Update the policy and value networks using the PPO clipped surrogate objective. The collected rollout data is split into minibatches and used for multiple epochs of gradient updates. The clipped objective prevents destructively large policy updates while the value function loss is also clipped for stability.

Key considerations:

Use multiple update epochs (e.g., 8) per rollout for sample efficiency
Apply gradient clipping (e.g., max_grad_norm=0.5) for training stability
Monitor the KL divergence and clip fraction to detect training issues
Entropy bonus encourages exploration in the early stages of training

Step 6: Evaluation and Checkpointing

Periodically evaluate the current policy on a separate set of environments and save model checkpoints. Evaluation uses deterministic action selection (mean of the policy distribution) and records success rates, episode returns, and episode lengths. Videos of evaluation episodes are captured for qualitative assessment.

Key considerations:

Evaluate at regular intervals (e.g., every 25 rollout updates)
Use separate evaluation environments with full episode resets
Track success rate as the primary metric for manipulation tasks
Save the best-performing checkpoint based on evaluation success rate

Execution Diagram

GitHub URL

Workflow Repository