Implementation:Haosulab ManiSkill PPO Eval Loop
| Field | Value |
|---|---|
| implementation_name | Haosulab_ManiSkill_PPO_Eval_Loop |
| overview | Concrete evaluation and checkpointing loop for PPO training in ManiSkill, with deterministic rollouts and metric aggregation |
| type | Pattern Doc |
| domains | Reinforcement_Learning, Robotics |
| last_updated | 2026-02-15 |
| related_pages | Principle:Haosulab_ManiSkill_RL_Evaluation_Checkpointing |
Overview
Description
The PPO evaluation loop runs periodically during training (controlled by eval_freq) to assess agent performance using deterministic actions on separate evaluation environments. It collects episode metrics (success rate, return, episode length) and saves model checkpoints. The evaluation environments are configured independently, typically with reconfiguration_freq=1 to ensure object/scene randomization across evaluation episodes, and with video recording enabled for qualitative assessment.
This is a Pattern Doc -- it documents the evaluation routine from the PPO example baseline, not a library API.
Usage
Evaluation is triggered every eval_freq iterations within the main training loop. It runs num_eval_steps environment steps across num_eval_envs parallel environments, collecting metrics from all completed episodes. Model checkpoints are saved at the same frequency.
Code Reference
| Field | Value |
|---|---|
| Repository | https://github.com/haosulab/ManiSkill |
| File | examples/baselines/ppo/ppo.py
|
| Evaluation loop | Lines 276-299 |
| Eval env setup | Lines 199-213 |
| Final checkpoint save | Lines 463-467 |
Evaluation environment setup:
env_kwargs = dict(obs_mode="state", render_mode="rgb_array", sim_backend="physx_cuda")
# Evaluation environments with reconfiguration for generalization testing
eval_envs = gym.make(
args.env_id,
num_envs=args.num_eval_envs,
reconfiguration_freq=args.eval_reconfiguration_freq, # default: 1
**env_kwargs,
)
if isinstance(eval_envs.action_space, gym.spaces.Dict):
eval_envs = FlattenActionSpaceWrapper(eval_envs)
# Video recording for evaluation episodes
if args.capture_video:
eval_envs = RecordEpisode(
eval_envs,
output_dir=f"runs/{run_name}/videos",
save_trajectory=False,
max_steps_per_video=args.num_eval_steps,
video_fps=30,
)
eval_envs = ManiSkillVectorEnv(
eval_envs,
args.num_eval_envs,
ignore_terminations=not args.eval_partial_reset, # default: True (no partial reset)
record_metrics=True,
)
Evaluation loop (runs inside the main training loop):
for iteration in range(1, args.num_iterations + 1):
agent.eval()
if iteration % args.eval_freq == 1:
print("Evaluating")
eval_obs, _ = eval_envs.reset()
eval_metrics = defaultdict(list)
num_episodes = 0
for _ in range(args.num_eval_steps):
with torch.no_grad():
eval_obs, eval_rew, eval_terminations, eval_truncations, eval_infos = \
eval_envs.step(agent.get_action(eval_obs, deterministic=True))
if "final_info" in eval_infos:
mask = eval_infos["_final_info"]
num_episodes += mask.sum()
for k, v in eval_infos["final_info"]["episode"].items():
eval_metrics[k].append(v)
print(f"Evaluated {args.num_eval_steps * args.num_eval_envs} steps "
f"resulting in {num_episodes} episodes")
for k, v in eval_metrics.items():
mean = torch.stack(v).float().mean()
if logger is not None:
logger.add_scalar(f"eval/{k}", mean, global_step)
print(f"eval_{k}_mean={mean}")
# Save model checkpoint at the same frequency
if args.save_model and iteration % args.eval_freq == 1:
model_path = f"runs/{run_name}/ckpt_{iteration}.pt"
torch.save(agent.state_dict(), model_path)
print(f"model saved to {model_path}")
Final checkpoint save (after training completes):
if not args.evaluate:
if args.save_model:
model_path = f"runs/{run_name}/final_ckpt.pt"
torch.save(agent.state_dict(), model_path)
print(f"model saved to {model_path}")
logger.close()
envs.close()
eval_envs.close()
I/O Contract
Evaluation configuration parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| eval_freq | int |
25 | Evaluation frequency in training iterations |
| num_eval_envs | int |
8 | Number of parallel evaluation environments |
| num_eval_steps | int |
50 | Number of steps per evaluation run |
| eval_partial_reset | bool |
False |
Whether eval envs use partial reset (default: disabled) |
| eval_reconfiguration_freq | Optional[int] |
1 | Reconfigure eval env each reset for object randomization |
| save_model | bool |
True |
Whether to save model checkpoints |
| capture_video | bool |
True |
Whether to record evaluation videos |
Evaluation metrics collected:
| Metric Key | Type | Description |
|---|---|---|
success_once |
float (0 or 1) |
Whether the task was completed at any point during the episode |
success_at_end |
float (0 or 1) |
Whether the task is successful at the final step (only when ignore_terminations=True)
|
return |
float |
Cumulative undiscounted reward over the episode |
episode_len |
float |
Number of steps in the episode |
reward |
float |
Average reward per step (return / episode_len)
|
fail_once |
float (0 or 1) |
Whether a failure condition was triggered (if the task defines failure) |
Checkpoint output:
| File | Location | Contents |
|---|---|---|
| Periodic checkpoint | runs/{run_name}/ckpt_{iteration}.pt |
agent.state_dict() at evaluation time
|
| Final checkpoint | runs/{run_name}/final_ckpt.pt |
agent.state_dict() after training completes
|
| Evaluation videos | runs/{run_name}/videos/ |
MP4 video files of evaluation episodes |
Usage Examples
Example 1: Run training with evaluation every 25 iterations
python examples/baselines/ppo/ppo.py \
--env_id="PickCube-v1" \
--num_envs=512 \
--eval_freq=25 \
--num_eval_envs=8 \
--num_eval_steps=50 \
--save_model=True \
--capture_video=True
Example 2: Load a checkpoint and run evaluation only
python examples/baselines/ppo/ppo.py \
--env_id="PickCube-v1" \
--evaluate=True \
--checkpoint="runs/PickCube-v1__ppo__1__1234567890/ckpt_101.pt" \
--num_eval_envs=8 \
--num_eval_steps=50
Example 3: Load checkpoint programmatically
import torch
# Create agent with same architecture
agent = Agent(envs).to(device)
# Load saved weights
agent.load_state_dict(torch.load("runs/PickCube-v1__ppo__1__1234567890/final_ckpt.pt"))
agent.eval()
# Run deterministic evaluation
eval_obs, _ = eval_envs.reset()
for step in range(num_eval_steps):
with torch.no_grad():
action = agent.get_action(eval_obs, deterministic=True)
eval_obs, eval_rew, _, _, eval_info = eval_envs.step(action)
Example 4: Interpreting evaluation metrics
# After evaluation completes, metrics are logged:
# eval/success_once_mean=0.85 -> 85% of episodes achieved the goal at least once
# eval/return_mean=45.3 -> average cumulative reward across episodes
# eval/episode_len_mean=38.2 -> average episode length (shorter = faster completion)
# eval/reward_mean=1.19 -> average reward per step
# A good training run shows:
# - success_once increasing over iterations (approaching 1.0)
# - return increasing over iterations
# - episode_len potentially decreasing (faster task completion)
Related Pages
- Principle:Haosulab_ManiSkill_RL_Evaluation_Checkpointing -- The principle this implementation realizes
- Implementation:Haosulab_ManiSkill_PPO_Agent_Network -- The agent whose
get_actionmethod is called deterministically - Implementation:Haosulab_ManiSkill_PPO_Training_Loop -- The training loop within which evaluation is embedded
- Implementation:Haosulab_ManiSkill_ManiSkillVectorEnv -- The vectorized wrapper used for evaluation environments
- Environment:Haosulab_ManiSkill_GPU_CUDA_Simulation