Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haosulab ManiSkill Imitation Policy Training

From Leeroopedia
Field Value
Source Repository haosulab/ManiSkill
Domains Imitation_Learning, Robotics, Machine_Learning, Deep_Learning
Last Updated 2026-02-15

Overview

Description

Imitation Policy Training is the core learning step in the imitation learning pipeline, where a neural network policy is trained to replicate expert behavior from demonstration data. In the ManiSkill ecosystem, two primary approaches are supported: Behavioral Cloning (BC) and Diffusion Policy.

Behavioral Cloning frames imitation learning as a direct supervised regression problem. Given a dataset of (observation, action) pairs from expert demonstrations, a neural network (typically an MLP for state-based inputs) is trained to minimize the mean squared error (MSE) between its predicted actions and the expert's recorded actions. This is the simplest and most straightforward approach to imitation learning, requiring only standard supervised learning machinery.

Diffusion Policy is a more sophisticated approach that models the action distribution as a denoising diffusion process. Instead of directly predicting actions, a conditional U-Net (ConditionalUnet1D) is trained to iteratively denoise random Gaussian noise into action sequences, conditioned on observation context. The training process follows the Denoising Diffusion Probabilistic Model (DDPM) framework: noise is added to expert actions at random timesteps according to a noise schedule, and the network learns to predict the added noise. At inference time, the model generates actions by starting from pure noise and iteratively denoising through the learned reverse process. Diffusion Policy operates on action sequences (horizons) rather than single actions, enabling temporally coherent behavior generation.

Usage

Policy training is used after dataset preparation (downloading demos, replaying/converting trajectories, loading into PyTorch datasets). It is the central step where the learning algorithm processes the demonstration data to produce a policy that can be deployed in the simulation environment.

Typical workflow:

  1. Prepare dataset (download -> replay/convert -> load)
  2. Configure training hyperparameters (learning rate, batch size, architecture, horizons)
  3. Run training loop with periodic evaluation
  4. Select best checkpoint based on evaluation metrics (success rate)

Theoretical Basis

Behavioral Cloning (Pomerleau, 1989)

Behavioral Cloning treats imitation learning as supervised regression. The policy is parameterized as a function pi(a|s) = f_theta(s) where f_theta is a neural network mapping observations to actions. The training objective minimizes:

L(theta) = E_{(s,a) ~ D_expert} [ ||f_theta(s) - a||^2 ]

where D_expert is the demonstration dataset.

Strengths:

  • Simple to implement -- standard supervised learning
  • Fast training -- no environment interaction needed
  • Stable optimization -- MSE loss is well-behaved

Weaknesses:

  • Distribution shift (compounding errors): Small prediction errors accumulate over time as the policy visits states not in the training distribution
  • Unimodal assumption: MSE regression produces the mean of multi-modal action distributions, which can result in averaging over distinct strategies

In ManiSkill's BC implementation, the policy network is a 3-layer MLP with ReLU activations (256 hidden units per layer), trained with the Adam optimizer.

Diffusion Policy (Chi et al., 2023)

Diffusion Policy addresses the limitations of behavioral cloning by modeling the full action distribution using denoising diffusion probabilistic models. The key ideas are:

  • Action sequence prediction: Rather than predicting a single action, the model predicts a sequence of future actions (the prediction horizon), of which only a subset (the action horizon) is executed before re-planning. This temporal consistency improves performance on tasks requiring smooth, coordinated motions.
  • Observation horizon: The model conditions on a window of recent observations (the observation horizon), enabling it to reason about velocity and temporal context.
  • Denoising process: During training, Gaussian noise is added to expert action sequences at random diffusion timesteps, and a ConditionalUnet1D is trained to predict the added noise. During inference, action sequences are generated by iterative denoising from pure Gaussian noise through the learned reverse diffusion process.

The training objective is:

L(theta) = E_{t, epsilon, a_0} [ ||epsilon_theta(a_t, t, o) - epsilon||^2 ]

where a_t = sqrt(alpha_bar_t) * a_0 + sqrt(1 - alpha_bar_t) * epsilon is the noised action, t is the diffusion timestep, o is the observation conditioning, and epsilon is the true noise.

Key hyperparameters:

Parameter Typical Value Description
obs_horizon 2 Number of past observations to condition on
act_horizon 8 Number of actions to execute before re-planning
pred_horizon 16 Total number of actions predicted per denoising pass
num_diffusion_iters 100 Number of diffusion timesteps (training)
noise_schedule squaredcos_cap_v2 Beta schedule for the forward diffusion process
prediction_type epsilon Predict noise (vs. denoised sample)

Additional training techniques used in ManiSkill's diffusion policy:

  • Exponential Moving Average (EMA): Maintains a smoothed copy of model weights for more stable evaluation
  • Cosine learning rate schedule with linear warmup
  • AdamW optimizer with weight decay

The constraint obs_horizon + act_horizon - 1 <= pred_horizon must hold.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment