Principle:Haosulab ManiSkill Imitation Policy Training

Field	Value
Source Repository	haosulab/ManiSkill
Domains	Imitation_Learning, Robotics, Machine_Learning, Deep_Learning
Last Updated	2026-02-15

Overview

Description

Imitation Policy Training is the core learning step in the imitation learning pipeline, where a neural network policy is trained to replicate expert behavior from demonstration data. In the ManiSkill ecosystem, two primary approaches are supported: Behavioral Cloning (BC) and Diffusion Policy.

Behavioral Cloning frames imitation learning as a direct supervised regression problem. Given a dataset of (observation, action) pairs from expert demonstrations, a neural network (typically an MLP for state-based inputs) is trained to minimize the mean squared error (MSE) between its predicted actions and the expert's recorded actions. This is the simplest and most straightforward approach to imitation learning, requiring only standard supervised learning machinery.

Diffusion Policy is a more sophisticated approach that models the action distribution as a denoising diffusion process. Instead of directly predicting actions, a conditional U-Net (ConditionalUnet1D) is trained to iteratively denoise random Gaussian noise into action sequences, conditioned on observation context. The training process follows the Denoising Diffusion Probabilistic Model (DDPM) framework: noise is added to expert actions at random timesteps according to a noise schedule, and the network learns to predict the added noise. At inference time, the model generates actions by starting from pure noise and iteratively denoising through the learned reverse process. Diffusion Policy operates on action sequences (horizons) rather than single actions, enabling temporally coherent behavior generation.

Usage

Policy training is used after dataset preparation (downloading demos, replaying/converting trajectories, loading into PyTorch datasets). It is the central step where the learning algorithm processes the demonstration data to produce a policy that can be deployed in the simulation environment.

Typical workflow:

Prepare dataset (download -> replay/convert -> load)
Configure training hyperparameters (learning rate, batch size, architecture, horizons)
Run training loop with periodic evaluation
Select best checkpoint based on evaluation metrics (success rate)

Theoretical Basis

Behavioral Cloning (Pomerleau, 1989)

Behavioral Cloning treats imitation learning as supervised regression. The policy is parameterized as a function pi(a|s) = f_theta(s) where f_theta is a neural network mapping observations to actions. The training objective minimizes:

L(theta) = E_{(s,a) ~ D_expert} [ ||f_theta(s) - a||^2 ]

where D_expert is the demonstration dataset.

Strengths:

Simple to implement -- standard supervised learning
Fast training -- no environment interaction needed
Stable optimization -- MSE loss is well-behaved

Weaknesses:

Distribution shift (compounding errors): Small prediction errors accumulate over time as the policy visits states not in the training distribution
Unimodal assumption: MSE regression produces the mean of multi-modal action distributions, which can result in averaging over distinct strategies

In ManiSkill's BC implementation, the policy network is a 3-layer MLP with ReLU activations (256 hidden units per layer), trained with the Adam optimizer.

Diffusion Policy (Chi et al., 2023)

Diffusion Policy addresses the limitations of behavioral cloning by modeling the full action distribution using denoising diffusion probabilistic models. The key ideas are:

Action sequence prediction: Rather than predicting a single action, the model predicts a sequence of future actions (the prediction horizon), of which only a subset (the action horizon) is executed before re-planning. This temporal consistency improves performance on tasks requiring smooth, coordinated motions.
Observation horizon: The model conditions on a window of recent observations (the observation horizon), enabling it to reason about velocity and temporal context.
Denoising process: During training, Gaussian noise is added to expert action sequences at random diffusion timesteps, and a ConditionalUnet1D is trained to predict the added noise. During inference, action sequences are generated by iterative denoising from pure Gaussian noise through the learned reverse diffusion process.

The training objective is:

L(theta) = E_{t, epsilon, a_0} [ ||epsilon_theta(a_t, t, o) - epsilon||^2 ]

where a_t = sqrt(alpha_bar_t) * a_0 + sqrt(1 - alpha_bar_t) * epsilon is the noised action, t is the diffusion timestep, o is the observation conditioning, and epsilon is the true noise.

Key hyperparameters:

Parameter	Typical Value	Description
`obs_horizon`	2	Number of past observations to condition on
`act_horizon`	8	Number of actions to execute before re-planning
`pred_horizon`	16	Total number of actions predicted per denoising pass
`num_diffusion_iters`	100	Number of diffusion timesteps (training)
`noise_schedule`	squaredcos_cap_v2	Beta schedule for the forward diffusion process
`prediction_type`	epsilon	Predict noise (vs. denoised sample)

Additional training techniques used in ManiSkill's diffusion policy:

Exponential Moving Average (EMA): Maintains a smoothed copy of model weights for more stable evaluation
Cosine learning rate schedule with linear warmup
AdamW optimizer with weight decay

The constraint obs_horizon + act_horizon - 1 <= pred_horizon must hold.

Related Pages

Implementation:Haosulab_ManiSkill_BC_Diffusion_Training -- The concrete training scripts for BC and diffusion policy.
Principle:Haosulab_ManiSkill_Trajectory_Dataset_Loading -- The preceding step: loading trajectory data into datasets.
Principle:Haosulab_ManiSkill_IL_Policy_Evaluation -- The next step: evaluating trained policies on simulation environments.
Principle:Haosulab_ManiSkill_Demonstration_Data_Acquisition -- The first step: acquiring expert demonstrations.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment