Implementation:Haosulab ManiSkill IL Eval Loop

Field	Value
Source Repository	haosulab/ManiSkill
Type	Pattern Doc
Domains	Imitation_Learning, Robotics, Evaluation, Machine_Learning
Last Updated	2026-02-15

Overview

Description

The IL Evaluation Loop is the concrete pattern used in ManiSkill's imitation learning baselines for evaluating trained policies on simulation environments. Two variants exist -- one for Behavioral Cloning (BC) and one for Diffusion Policy -- each tailored to the respective policy's action generation mechanism. Both follow the same high-level structure: create vectorized evaluation environments, reset all environments, roll out the policy for complete episodes, collect per-episode metrics from the final_info dictionary, and aggregate results.

The BC evaluation function (behavior_cloning/evaluate.py) takes a callable sample_fn that maps observations to single actions, and steps each environment one action at a time until all episodes truncate.

The Diffusion Policy evaluation function (diffusion_policy/evaluate.py) takes an Agent object, calls agent.get_action(obs) to produce an action sequence of length act_horizon, executes all actions in the sequence sequentially, and then re-plans with fresh observations. It uses the EMA-smoothed model weights for more stable evaluation.

Both evaluation loops collect metrics from the info["final_info"]["episode"] dictionary that ManiSkill environments populate upon episode termination. These metrics typically include success_once, success_at_end, episode_length, and return.

Usage

The evaluation loop is called periodically during training (controlled by eval_freq) and at the end of training. It is also used standalone for evaluating saved checkpoints. The evaluation results drive checkpoint selection (saving the model with the best success_once or success_at_end rate).

Code Reference

Source Location

Script	File	Lines
BC Evaluate	`examples/baselines/bc/behavior_cloning/evaluate.py`	L6-38
Diffusion Policy Evaluate	`examples/baselines/diffusion_policy/diffusion_policy/evaluate.py`	L7-40
BC Training (eval call site)	`examples/baselines/bc/bc.py`	L334-361
Diffusion Training (eval call site)	`examples/baselines/diffusion_policy/train.py`	L360-382

Signature

BC evaluate function:

def evaluate(n: int, sample_fn: Callable, eval_envs):
    """
    Evaluate the agent on the evaluation environments for at least n episodes.

    Args:
        n: The minimum number of episodes to evaluate.
        sample_fn: The function to call to sample actions from the agent
                   by passing in the observations.
        eval_envs: The evaluation environments (vectorized).

    Returns:
        A dictionary containing the evaluation results.
    """

Diffusion Policy evaluate function:

def evaluate(n: int, agent, eval_envs, device, sim_backend: str, progress_bar: bool = True):
    """
    Evaluate the diffusion policy agent for at least n episodes.

    Args:
        n: Minimum number of episodes to evaluate.
        agent: The Agent module (with get_action method).
        eval_envs: Vectorized evaluation environments.
        device: Torch device for tensor operations.
        sim_backend: Simulation backend string ('physx_cpu' or 'physx_gpu').
        progress_bar: Whether to display a progress bar.

    Returns:
        A dictionary containing the evaluation results.
    """

Key parameters (from training Args):

Parameter	BC Default	Diffusion Default	Description
`num_eval_episodes`	100	100	Minimum number of episodes to evaluate per evaluation call
`num_eval_envs`	10	10	Number of parallel environments for evaluation
`eval_freq`	1000	5000	Evaluate every N training iterations
`sim_backend`	"cpu"	"physx_cpu"	Simulation backend for evaluation environments
`capture_video`	True	True	Whether to record evaluation videos via RecordEpisode

Import

BC evaluation:

from behavior_cloning.evaluate import evaluate

Diffusion Policy evaluation:

from diffusion_policy.evaluate import evaluate

I/O Contract

Inputs:

Input	Type	Description
`n`	int	Minimum number of episodes to evaluate. Actual count may be slightly higher due to parallel environment batching.
`sample_fn` (BC)	Callable	Function mapping observations (ndarray or tensor) to actions.
`agent` (Diffusion)	nn.Module	Agent with `get_action(obs_seq)` method returning action sequences of shape `(B, act_horizon, act_dim)`.
`eval_envs`	VectorEnv	Vectorized ManiSkill environments, optionally wrapped with RecordEpisode for video capture.

Outputs:

Output	Type	Description
`eval_metrics`	dict[str, ndarray]	Dictionary of evaluation metrics. Keys typically include `success_once`, `success_at_end`, `episode_length`, `return`. Values are arrays of per-episode results.

Checkpoint saving logic (from training scripts):

Checkpoints are saved when evaluation metrics improve:

save_on_best_metrics = ["success_once", "success_at_end"]
for k in save_on_best_metrics:
    if k in eval_metrics and eval_metrics[k] > best_eval_metrics[k]:
        best_eval_metrics[k] = eval_metrics[k]
        save_ckpt(run_name, f"best_eval_{k}")

Usage Examples

Example 1: BC evaluation during training

# Inside bc.py training loop
if iteration % args.eval_freq == 0:
    actor.eval()

    def sample_fn(obs):
        if isinstance(obs, np.ndarray):
            obs = torch.from_numpy(obs).float().to(device)
        action = actor(obs)
        if args.sim_backend == "cpu":
            action = action.cpu().numpy()
        return action

    with torch.no_grad():
        eval_metrics = evaluate(args.num_eval_episodes, sample_fn, envs)
    actor.train()

    for k in eval_metrics.keys():
        eval_metrics[k] = np.mean(eval_metrics[k])
        print(f"{k}: {eval_metrics[k]:.4f}")

Example 2: Diffusion Policy evaluation with EMA weights

# Inside train.py evaluation function
ema.copy_to(ema_agent.parameters())
eval_metrics = evaluate(
    args.num_eval_episodes,
    ema_agent,
    envs,
    device,
    args.sim_backend,
)
for k in eval_metrics.keys():
    eval_metrics[k] = np.mean(eval_metrics[k])
    writer.add_scalar(f"eval/{k}", eval_metrics[k], iteration)

Example 3: Diffusion Policy action execution pattern

# Inside diffusion_policy/evaluate.py
obs = common.to_tensor(obs, device)
action_seq = agent.get_action(obs)  # (B, act_horizon, act_dim)
if sim_backend == "physx_cpu":
    action_seq = action_seq.cpu().numpy()
for i in range(action_seq.shape[1]):
    obs, rew, terminated, truncated, info = eval_envs.step(action_seq[:, i])
    if truncated.any():
        break

Example 4: Creating evaluation environments with video recording

from behavior_cloning.make_env import make_eval_envs

env_kwargs = dict(
    control_mode="pd_joint_delta_pos",
    reward_mode="sparse",
    obs_mode="state",
    render_mode="rgb_array",
)
envs = make_eval_envs(
    "PickCube-v1",
    num_eval_envs=10,
    sim_backend="cpu",
    env_kwargs=env_kwargs,
    video_dir="runs/eval_videos",
)

Related Pages

Principle:Haosulab_ManiSkill_IL_Policy_Evaluation -- The principle describing evaluation theory and metrics for imitation learning policies.
Implementation:Haosulab_ManiSkill_BC_Diffusion_Training -- The training scripts that call these evaluation loops.
Implementation:Haosulab_ManiSkill_ManiSkillTrajectoryDataset -- The dataset used to train the policies being evaluated.
Environment:Haosulab_ManiSkill_GPU_CUDA_Simulation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment