Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haosulab ManiSkill IL Eval Loop

From Leeroopedia
Field Value
Source Repository haosulab/ManiSkill
Type Pattern Doc
Domains Imitation_Learning, Robotics, Evaluation, Machine_Learning
Last Updated 2026-02-15

Overview

Description

The IL Evaluation Loop is the concrete pattern used in ManiSkill's imitation learning baselines for evaluating trained policies on simulation environments. Two variants exist -- one for Behavioral Cloning (BC) and one for Diffusion Policy -- each tailored to the respective policy's action generation mechanism. Both follow the same high-level structure: create vectorized evaluation environments, reset all environments, roll out the policy for complete episodes, collect per-episode metrics from the final_info dictionary, and aggregate results.

The BC evaluation function (behavior_cloning/evaluate.py) takes a callable sample_fn that maps observations to single actions, and steps each environment one action at a time until all episodes truncate.

The Diffusion Policy evaluation function (diffusion_policy/evaluate.py) takes an Agent object, calls agent.get_action(obs) to produce an action sequence of length act_horizon, executes all actions in the sequence sequentially, and then re-plans with fresh observations. It uses the EMA-smoothed model weights for more stable evaluation.

Both evaluation loops collect metrics from the info["final_info"]["episode"] dictionary that ManiSkill environments populate upon episode termination. These metrics typically include success_once, success_at_end, episode_length, and return.

Usage

The evaluation loop is called periodically during training (controlled by eval_freq) and at the end of training. It is also used standalone for evaluating saved checkpoints. The evaluation results drive checkpoint selection (saving the model with the best success_once or success_at_end rate).

Code Reference

Source Location

Script File Lines
BC Evaluate examples/baselines/bc/behavior_cloning/evaluate.py L6-38
Diffusion Policy Evaluate examples/baselines/diffusion_policy/diffusion_policy/evaluate.py L7-40
BC Training (eval call site) examples/baselines/bc/bc.py L334-361
Diffusion Training (eval call site) examples/baselines/diffusion_policy/train.py L360-382

Signature

BC evaluate function:

def evaluate(n: int, sample_fn: Callable, eval_envs):
    """
    Evaluate the agent on the evaluation environments for at least n episodes.

    Args:
        n: The minimum number of episodes to evaluate.
        sample_fn: The function to call to sample actions from the agent
                   by passing in the observations.
        eval_envs: The evaluation environments (vectorized).

    Returns:
        A dictionary containing the evaluation results.
    """

Diffusion Policy evaluate function:

def evaluate(n: int, agent, eval_envs, device, sim_backend: str, progress_bar: bool = True):
    """
    Evaluate the diffusion policy agent for at least n episodes.

    Args:
        n: Minimum number of episodes to evaluate.
        agent: The Agent module (with get_action method).
        eval_envs: Vectorized evaluation environments.
        device: Torch device for tensor operations.
        sim_backend: Simulation backend string ('physx_cpu' or 'physx_gpu').
        progress_bar: Whether to display a progress bar.

    Returns:
        A dictionary containing the evaluation results.
    """

Key parameters (from training Args):

Parameter BC Default Diffusion Default Description
num_eval_episodes 100 100 Minimum number of episodes to evaluate per evaluation call
num_eval_envs 10 10 Number of parallel environments for evaluation
eval_freq 1000 5000 Evaluate every N training iterations
sim_backend "cpu" "physx_cpu" Simulation backend for evaluation environments
capture_video True True Whether to record evaluation videos via RecordEpisode

Import

BC evaluation:

from behavior_cloning.evaluate import evaluate

Diffusion Policy evaluation:

from diffusion_policy.evaluate import evaluate

I/O Contract

Inputs:

Input Type Description
n int Minimum number of episodes to evaluate. Actual count may be slightly higher due to parallel environment batching.
sample_fn (BC) Callable Function mapping observations (ndarray or tensor) to actions.
agent (Diffusion) nn.Module Agent with get_action(obs_seq) method returning action sequences of shape (B, act_horizon, act_dim).
eval_envs VectorEnv Vectorized ManiSkill environments, optionally wrapped with RecordEpisode for video capture.

Outputs:

Output Type Description
eval_metrics dict[str, ndarray] Dictionary of evaluation metrics. Keys typically include success_once, success_at_end, episode_length, return. Values are arrays of per-episode results.

Checkpoint saving logic (from training scripts):

Checkpoints are saved when evaluation metrics improve:

save_on_best_metrics = ["success_once", "success_at_end"]
for k in save_on_best_metrics:
    if k in eval_metrics and eval_metrics[k] > best_eval_metrics[k]:
        best_eval_metrics[k] = eval_metrics[k]
        save_ckpt(run_name, f"best_eval_{k}")

Usage Examples

Example 1: BC evaluation during training

# Inside bc.py training loop
if iteration % args.eval_freq == 0:
    actor.eval()

    def sample_fn(obs):
        if isinstance(obs, np.ndarray):
            obs = torch.from_numpy(obs).float().to(device)
        action = actor(obs)
        if args.sim_backend == "cpu":
            action = action.cpu().numpy()
        return action

    with torch.no_grad():
        eval_metrics = evaluate(args.num_eval_episodes, sample_fn, envs)
    actor.train()

    for k in eval_metrics.keys():
        eval_metrics[k] = np.mean(eval_metrics[k])
        print(f"{k}: {eval_metrics[k]:.4f}")

Example 2: Diffusion Policy evaluation with EMA weights

# Inside train.py evaluation function
ema.copy_to(ema_agent.parameters())
eval_metrics = evaluate(
    args.num_eval_episodes,
    ema_agent,
    envs,
    device,
    args.sim_backend,
)
for k in eval_metrics.keys():
    eval_metrics[k] = np.mean(eval_metrics[k])
    writer.add_scalar(f"eval/{k}", eval_metrics[k], iteration)

Example 3: Diffusion Policy action execution pattern

# Inside diffusion_policy/evaluate.py
obs = common.to_tensor(obs, device)
action_seq = agent.get_action(obs)  # (B, act_horizon, act_dim)
if sim_backend == "physx_cpu":
    action_seq = action_seq.cpu().numpy()
for i in range(action_seq.shape[1]):
    obs, rew, terminated, truncated, info = eval_envs.step(action_seq[:, i])
    if truncated.any():
        break

Example 4: Creating evaluation environments with video recording

from behavior_cloning.make_env import make_eval_envs

env_kwargs = dict(
    control_mode="pd_joint_delta_pos",
    reward_mode="sparse",
    obs_mode="state",
    render_mode="rgb_array",
)
envs = make_eval_envs(
    "PickCube-v1",
    num_eval_envs=10,
    sim_backend="cpu",
    env_kwargs=env_kwargs,
    video_dir="runs/eval_videos",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment