Principle:Haosulab ManiSkill Sim2Real Policy Training
| Field | Value |
|---|---|
| Principle Name | Sim2Real Policy Training |
| Domain | Sim2Real |
| Overview | Training RL/IL policies on digital twin environments with domain randomization for real deployment |
| Date | 2026-02-15 |
| Repository | Haosulab/ManiSkill |
Overview
The Sim2Real Policy Training principle describes how ManiSkill combines digital twin environments with domain randomization to train policies in simulation that can be deployed directly on real robots. Rather than introducing a new training algorithm, this principle composes existing training workflows (PPO for reinforcement learning, behavioral cloning, Diffusion Policy for imitation learning) with simulation environments specifically designed for real-world transfer.
Description
The sim2real policy training workflow follows a three-stage pipeline:
Stage 1: Digital Twin Environment Setup
A digital twin environment is created that replicates the real-world workspace (see Principle:Haosulab_ManiSkill_Digital_Twin_Construction). This environment includes:
- Accurate robot model (URDF matching the physical robot)
- Workspace geometry (table, fixtures, backgrounds via greenscreening)
- Camera placement matching the real camera setup
- Task-specific objects with approximate physical properties
ManiSkill provides ready-made digital twin environments such as GraspCubeSO100Digital-v1 for the SO-100 robot arm.
Stage 2: Domain Randomization During Training
During training episodes, the environment randomizes parameters (see Principle:Haosulab_ManiSkill_Domain_Randomization) to broaden the training distribution:
- Camera poses within a plausible region around the nominal position
- Lighting conditions (direction, intensity)
- Object initial poses and orientations
- Visual properties (textures, colors)
This randomization is applied at each episode reset, ensuring the policy sees diverse conditions during training.
Stage 3: Policy Training with Standard Algorithms
The randomized digital twin environment is then used with standard training pipelines:
- RL training with PPO: The environment is vectorized across GPU-parallel instances and trained with Proximal Policy Optimization, using visual observations (RGB images from the camera) as input.
- Imitation learning with BC/Diffusion Policy: Demonstrations are collected (either via motion planning or teleoperation) in the digital twin environment, and policies are trained via behavioral cloning or more sophisticated methods like Diffusion Policy.
The key difference from standard training is that the environment ID is a digital twin variant (e.g., GraspCubeSO100Digital-v1 instead of PickCube-v1), and the observation mode includes visual data (rgb) that will be available from the real camera.
Usage
import gymnasium as gym
# Create a digital twin environment for sim2real training
env = gym.make(
"GraspCubeSO100Digital-v1",
obs_mode="rgb",
control_mode="pd_joint_pos",
num_envs=256,
sim_backend="gpu",
)
# Apply standard training wrappers and train with PPO, BC, etc.
Theoretical Basis
- Sim-to-real transfer: Training in simulation and deploying on real hardware is a well-established paradigm in robotics. The primary challenges are the reality gap (differences between simulation and real physics/visuals) and distribution shift (the policy encountering states not seen during training).
- Visual policy learning: Policies that operate directly on camera images (rather than privileged state information) are necessary for real-world deployment where ground-truth object poses are unavailable. Digital twin environments ensure the visual observations approximate real camera feeds.
- Domain adaptation via randomization: Rather than explicitly adapting the policy or the simulation to match reality (which requires real-world data), domain randomization makes the policy robust to a range of conditions that includes reality. This is sometimes called zero-shot sim-to-real transfer because no real-world fine-tuning is needed.
- Composability: A key design principle is that sim2real training does not require a new training algorithm. Instead, it composes existing algorithms (PPO, BC, Diffusion Policy) with a sim2real-aware environment, keeping the training pipeline modular and maintainable.