Principle:Haosulab ManiSkill IL Policy Evaluation

Field	Value
Source Repository	haosulab/ManiSkill
Domains	Imitation_Learning, Robotics, Evaluation, Machine_Learning
Last Updated	2026-02-15

Overview

Description

IL Policy Evaluation is the process of measuring the performance of trained imitation learning policies by rolling them out in simulation environments and collecting task-completion metrics. After training a behavioral cloning or diffusion policy from expert demonstrations, the policy must be evaluated in closed-loop interaction with the environment to determine whether it can actually solve the task -- a step that is especially important for imitation learning because training loss (MSE on action prediction) is often a poor proxy for actual task performance.

Evaluation in ManiSkill follows a standard rollout protocol: multiple parallel environments are created with the same task configuration used during training, the policy generates actions from observations at each timestep, and the environments report success/failure and other metrics at episode termination. The key output metric is the success rate -- the fraction of episodes in which the robot successfully completes the task. Additional metrics include success_once (whether the task was achieved at any point during the episode), success_at_end (whether the task was achieved at the final timestep), and episode_length.

Evaluation is typically performed periodically during training (at fixed iteration intervals) to monitor learning progress and select the best model checkpoint. The best checkpoint is selected based on the highest evaluation success rate, which may differ from the iteration with the lowest training loss due to distribution shift effects inherent in behavioral cloning.

Usage

Policy evaluation is used during and after training to:

Monitor learning progress and detect training issues (overfitting, underfitting, distribution shift).
Select the best model checkpoint for deployment based on task-completion metrics.
Compare different algorithms, hyperparameters, or demonstration datasets on the same evaluation protocol.
Generate evaluation videos for qualitative analysis of policy behavior.
Validate that a trained policy meets performance thresholds before real-world transfer.

Theoretical Basis

Closed-Loop Evaluation is essential for imitation learning because the training objective (minimizing single-step action prediction error) does not account for the compounding effect of prediction errors over time. A policy with low training loss can still fail at the task if small errors accumulate and drive the system into states not represented in the training data (distribution shift). Only by rolling out the policy in the actual environment can its true task-completion ability be measured.

Key considerations for IL policy evaluation:

Episode-Level Metrics vs. Step-Level Metrics: While training optimizes a step-level metric (per-step MSE loss), evaluation measures episode-level outcomes (success/failure). The relationship between these two metrics is non-linear and task-dependent.
Stochasticity in Evaluation: Even with deterministic policies, evaluation outcomes can vary due to randomized initial conditions (different seeds, object poses, robot configurations). Running sufficient evaluation episodes (typically 100 in ManiSkill) is important for statistically meaningful success rate estimates.
Parallel Evaluation: Running multiple environments in parallel accelerates evaluation. ManiSkill supports both CPU-based parallel environments (via vectorized wrappers) and GPU-based parallel simulation. When all environments run synchronized episodes (no partial resets), truncation occurs simultaneously across all environments, ensuring fair evaluation.
Temporal Action Execution: For diffusion policy, evaluation involves executing a sequence of actions (the action horizon) from each denoising pass before re-planning with new observations. This differs from BC evaluation where a single action is executed per observation.
EMA Model Evaluation: Diffusion policy evaluation uses the Exponential Moving Average (EMA) copy of the model weights rather than the raw training weights, as EMA typically provides more stable and higher-performing behavior.
Checkpoint Selection: Best checkpoints are saved independently for different metrics (e.g., best_eval_success_once and best_eval_success_at_end), since these metrics may peak at different training iterations.

Related Pages

Implementation:Haosulab_ManiSkill_IL_Eval_Loop -- The concrete evaluation loop implementation for BC and diffusion policies.
Principle:Haosulab_ManiSkill_Imitation_Policy_Training -- The preceding step: training the policy from demonstrations.
Principle:Haosulab_ManiSkill_Demonstration_Data_Acquisition -- The first step: acquiring expert demonstrations.
Principle:Haosulab_ManiSkill_LeRobot_Format_Export -- An alternative downstream step: exporting data for cross-framework use.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment