Implementation:Haosulab ManiSkill IL Eval Loop
| Field | Value |
|---|---|
| Source Repository | haosulab/ManiSkill |
| Type | Pattern Doc |
| Domains | Imitation_Learning, Robotics, Evaluation, Machine_Learning |
| Last Updated | 2026-02-15 |
Overview
Description
The IL Evaluation Loop is the concrete pattern used in ManiSkill's imitation learning baselines for evaluating trained policies on simulation environments. Two variants exist -- one for Behavioral Cloning (BC) and one for Diffusion Policy -- each tailored to the respective policy's action generation mechanism. Both follow the same high-level structure: create vectorized evaluation environments, reset all environments, roll out the policy for complete episodes, collect per-episode metrics from the final_info dictionary, and aggregate results.
The BC evaluation function (behavior_cloning/evaluate.py) takes a callable sample_fn that maps observations to single actions, and steps each environment one action at a time until all episodes truncate.
The Diffusion Policy evaluation function (diffusion_policy/evaluate.py) takes an Agent object, calls agent.get_action(obs) to produce an action sequence of length act_horizon, executes all actions in the sequence sequentially, and then re-plans with fresh observations. It uses the EMA-smoothed model weights for more stable evaluation.
Both evaluation loops collect metrics from the info["final_info"]["episode"] dictionary that ManiSkill environments populate upon episode termination. These metrics typically include success_once, success_at_end, episode_length, and return.
Usage
The evaluation loop is called periodically during training (controlled by eval_freq) and at the end of training. It is also used standalone for evaluating saved checkpoints. The evaluation results drive checkpoint selection (saving the model with the best success_once or success_at_end rate).
Code Reference
Source Location
| Script | File | Lines |
|---|---|---|
| BC Evaluate | examples/baselines/bc/behavior_cloning/evaluate.py |
L6-38 |
| Diffusion Policy Evaluate | examples/baselines/diffusion_policy/diffusion_policy/evaluate.py |
L7-40 |
| BC Training (eval call site) | examples/baselines/bc/bc.py |
L334-361 |
| Diffusion Training (eval call site) | examples/baselines/diffusion_policy/train.py |
L360-382 |
Signature
BC evaluate function:
def evaluate(n: int, sample_fn: Callable, eval_envs):
"""
Evaluate the agent on the evaluation environments for at least n episodes.
Args:
n: The minimum number of episodes to evaluate.
sample_fn: The function to call to sample actions from the agent
by passing in the observations.
eval_envs: The evaluation environments (vectorized).
Returns:
A dictionary containing the evaluation results.
"""
Diffusion Policy evaluate function:
def evaluate(n: int, agent, eval_envs, device, sim_backend: str, progress_bar: bool = True):
"""
Evaluate the diffusion policy agent for at least n episodes.
Args:
n: Minimum number of episodes to evaluate.
agent: The Agent module (with get_action method).
eval_envs: Vectorized evaluation environments.
device: Torch device for tensor operations.
sim_backend: Simulation backend string ('physx_cpu' or 'physx_gpu').
progress_bar: Whether to display a progress bar.
Returns:
A dictionary containing the evaluation results.
"""
Key parameters (from training Args):
| Parameter | BC Default | Diffusion Default | Description |
|---|---|---|---|
num_eval_episodes |
100 | 100 | Minimum number of episodes to evaluate per evaluation call |
num_eval_envs |
10 | 10 | Number of parallel environments for evaluation |
eval_freq |
1000 | 5000 | Evaluate every N training iterations |
sim_backend |
"cpu" | "physx_cpu" | Simulation backend for evaluation environments |
capture_video |
True | True | Whether to record evaluation videos via RecordEpisode |
Import
BC evaluation:
from behavior_cloning.evaluate import evaluate
Diffusion Policy evaluation:
from diffusion_policy.evaluate import evaluate
I/O Contract
Inputs:
| Input | Type | Description |
|---|---|---|
n |
int | Minimum number of episodes to evaluate. Actual count may be slightly higher due to parallel environment batching. |
sample_fn (BC) |
Callable | Function mapping observations (ndarray or tensor) to actions. |
agent (Diffusion) |
nn.Module | Agent with get_action(obs_seq) method returning action sequences of shape (B, act_horizon, act_dim).
|
eval_envs |
VectorEnv | Vectorized ManiSkill environments, optionally wrapped with RecordEpisode for video capture. |
Outputs:
| Output | Type | Description |
|---|---|---|
eval_metrics |
dict[str, ndarray] | Dictionary of evaluation metrics. Keys typically include success_once, success_at_end, episode_length, return. Values are arrays of per-episode results.
|
Checkpoint saving logic (from training scripts):
Checkpoints are saved when evaluation metrics improve:
save_on_best_metrics = ["success_once", "success_at_end"]
for k in save_on_best_metrics:
if k in eval_metrics and eval_metrics[k] > best_eval_metrics[k]:
best_eval_metrics[k] = eval_metrics[k]
save_ckpt(run_name, f"best_eval_{k}")
Usage Examples
Example 1: BC evaluation during training
# Inside bc.py training loop
if iteration % args.eval_freq == 0:
actor.eval()
def sample_fn(obs):
if isinstance(obs, np.ndarray):
obs = torch.from_numpy(obs).float().to(device)
action = actor(obs)
if args.sim_backend == "cpu":
action = action.cpu().numpy()
return action
with torch.no_grad():
eval_metrics = evaluate(args.num_eval_episodes, sample_fn, envs)
actor.train()
for k in eval_metrics.keys():
eval_metrics[k] = np.mean(eval_metrics[k])
print(f"{k}: {eval_metrics[k]:.4f}")
Example 2: Diffusion Policy evaluation with EMA weights
# Inside train.py evaluation function
ema.copy_to(ema_agent.parameters())
eval_metrics = evaluate(
args.num_eval_episodes,
ema_agent,
envs,
device,
args.sim_backend,
)
for k in eval_metrics.keys():
eval_metrics[k] = np.mean(eval_metrics[k])
writer.add_scalar(f"eval/{k}", eval_metrics[k], iteration)
Example 3: Diffusion Policy action execution pattern
# Inside diffusion_policy/evaluate.py
obs = common.to_tensor(obs, device)
action_seq = agent.get_action(obs) # (B, act_horizon, act_dim)
if sim_backend == "physx_cpu":
action_seq = action_seq.cpu().numpy()
for i in range(action_seq.shape[1]):
obs, rew, terminated, truncated, info = eval_envs.step(action_seq[:, i])
if truncated.any():
break
Example 4: Creating evaluation environments with video recording
from behavior_cloning.make_env import make_eval_envs
env_kwargs = dict(
control_mode="pd_joint_delta_pos",
reward_mode="sparse",
obs_mode="state",
render_mode="rgb_array",
)
envs = make_eval_envs(
"PickCube-v1",
num_eval_envs=10,
sim_backend="cpu",
env_kwargs=env_kwargs,
video_dir="runs/eval_videos",
)
Related Pages
- Principle:Haosulab_ManiSkill_IL_Policy_Evaluation -- The principle describing evaluation theory and metrics for imitation learning policies.
- Implementation:Haosulab_ManiSkill_BC_Diffusion_Training -- The training scripts that call these evaluation loops.
- Implementation:Haosulab_ManiSkill_ManiSkillTrajectoryDataset -- The dataset used to train the policies being evaluated.
- Environment:Haosulab_ManiSkill_GPU_CUDA_Simulation