Principle:Haosulab ManiSkill Imitation Policy Training
| Field | Value |
|---|---|
| Source Repository | haosulab/ManiSkill |
| Domains | Imitation_Learning, Robotics, Machine_Learning, Deep_Learning |
| Last Updated | 2026-02-15 |
Overview
Description
Imitation Policy Training is the core learning step in the imitation learning pipeline, where a neural network policy is trained to replicate expert behavior from demonstration data. In the ManiSkill ecosystem, two primary approaches are supported: Behavioral Cloning (BC) and Diffusion Policy.
Behavioral Cloning frames imitation learning as a direct supervised regression problem. Given a dataset of (observation, action) pairs from expert demonstrations, a neural network (typically an MLP for state-based inputs) is trained to minimize the mean squared error (MSE) between its predicted actions and the expert's recorded actions. This is the simplest and most straightforward approach to imitation learning, requiring only standard supervised learning machinery.
Diffusion Policy is a more sophisticated approach that models the action distribution as a denoising diffusion process. Instead of directly predicting actions, a conditional U-Net (ConditionalUnet1D) is trained to iteratively denoise random Gaussian noise into action sequences, conditioned on observation context. The training process follows the Denoising Diffusion Probabilistic Model (DDPM) framework: noise is added to expert actions at random timesteps according to a noise schedule, and the network learns to predict the added noise. At inference time, the model generates actions by starting from pure noise and iteratively denoising through the learned reverse process. Diffusion Policy operates on action sequences (horizons) rather than single actions, enabling temporally coherent behavior generation.
Usage
Policy training is used after dataset preparation (downloading demos, replaying/converting trajectories, loading into PyTorch datasets). It is the central step where the learning algorithm processes the demonstration data to produce a policy that can be deployed in the simulation environment.
Typical workflow:
- Prepare dataset (download -> replay/convert -> load)
- Configure training hyperparameters (learning rate, batch size, architecture, horizons)
- Run training loop with periodic evaluation
- Select best checkpoint based on evaluation metrics (success rate)
Theoretical Basis
Behavioral Cloning (Pomerleau, 1989)
Behavioral Cloning treats imitation learning as supervised regression. The policy is parameterized as a function pi(a|s) = f_theta(s) where f_theta is a neural network mapping observations to actions. The training objective minimizes:
L(theta) = E_{(s,a) ~ D_expert} [ ||f_theta(s) - a||^2 ]
where D_expert is the demonstration dataset.
Strengths:
- Simple to implement -- standard supervised learning
- Fast training -- no environment interaction needed
- Stable optimization -- MSE loss is well-behaved
Weaknesses:
- Distribution shift (compounding errors): Small prediction errors accumulate over time as the policy visits states not in the training distribution
- Unimodal assumption: MSE regression produces the mean of multi-modal action distributions, which can result in averaging over distinct strategies
In ManiSkill's BC implementation, the policy network is a 3-layer MLP with ReLU activations (256 hidden units per layer), trained with the Adam optimizer.
Diffusion Policy (Chi et al., 2023)
Diffusion Policy addresses the limitations of behavioral cloning by modeling the full action distribution using denoising diffusion probabilistic models. The key ideas are:
- Action sequence prediction: Rather than predicting a single action, the model predicts a sequence of future actions (the prediction horizon), of which only a subset (the action horizon) is executed before re-planning. This temporal consistency improves performance on tasks requiring smooth, coordinated motions.
- Observation horizon: The model conditions on a window of recent observations (the observation horizon), enabling it to reason about velocity and temporal context.
- Denoising process: During training, Gaussian noise is added to expert action sequences at random diffusion timesteps, and a ConditionalUnet1D is trained to predict the added noise. During inference, action sequences are generated by iterative denoising from pure Gaussian noise through the learned reverse diffusion process.
The training objective is:
L(theta) = E_{t, epsilon, a_0} [ ||epsilon_theta(a_t, t, o) - epsilon||^2 ]
where a_t = sqrt(alpha_bar_t) * a_0 + sqrt(1 - alpha_bar_t) * epsilon is the noised action, t is the diffusion timestep, o is the observation conditioning, and epsilon is the true noise.
Key hyperparameters:
| Parameter | Typical Value | Description |
|---|---|---|
obs_horizon |
2 | Number of past observations to condition on |
act_horizon |
8 | Number of actions to execute before re-planning |
pred_horizon |
16 | Total number of actions predicted per denoising pass |
num_diffusion_iters |
100 | Number of diffusion timesteps (training) |
noise_schedule |
squaredcos_cap_v2 | Beta schedule for the forward diffusion process |
prediction_type |
epsilon | Predict noise (vs. denoised sample) |
Additional training techniques used in ManiSkill's diffusion policy:
- Exponential Moving Average (EMA): Maintains a smoothed copy of model weights for more stable evaluation
- Cosine learning rate schedule with linear warmup
- AdamW optimizer with weight decay
The constraint obs_horizon + act_horizon - 1 <= pred_horizon must hold.
Related Pages
- Implementation:Haosulab_ManiSkill_BC_Diffusion_Training -- The concrete training scripts for BC and diffusion policy.
- Principle:Haosulab_ManiSkill_Trajectory_Dataset_Loading -- The preceding step: loading trajectory data into datasets.
- Principle:Haosulab_ManiSkill_IL_Policy_Evaluation -- The next step: evaluating trained policies on simulation environments.
- Principle:Haosulab_ManiSkill_Demonstration_Data_Acquisition -- The first step: acquiring expert demonstrations.