Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haosulab ManiSkill PPO Eval Loop

From Leeroopedia
Field Value
implementation_name Haosulab_ManiSkill_PPO_Eval_Loop
overview Concrete evaluation and checkpointing loop for PPO training in ManiSkill, with deterministic rollouts and metric aggregation
type Pattern Doc
domains Reinforcement_Learning, Robotics
last_updated 2026-02-15
related_pages Principle:Haosulab_ManiSkill_RL_Evaluation_Checkpointing

Overview

Description

The PPO evaluation loop runs periodically during training (controlled by eval_freq) to assess agent performance using deterministic actions on separate evaluation environments. It collects episode metrics (success rate, return, episode length) and saves model checkpoints. The evaluation environments are configured independently, typically with reconfiguration_freq=1 to ensure object/scene randomization across evaluation episodes, and with video recording enabled for qualitative assessment.

This is a Pattern Doc -- it documents the evaluation routine from the PPO example baseline, not a library API.

Usage

Evaluation is triggered every eval_freq iterations within the main training loop. It runs num_eval_steps environment steps across num_eval_envs parallel environments, collecting metrics from all completed episodes. Model checkpoints are saved at the same frequency.

Code Reference

Field Value
Repository https://github.com/haosulab/ManiSkill
File examples/baselines/ppo/ppo.py
Evaluation loop Lines 276-299
Eval env setup Lines 199-213
Final checkpoint save Lines 463-467

Evaluation environment setup:

env_kwargs = dict(obs_mode="state", render_mode="rgb_array", sim_backend="physx_cuda")

# Evaluation environments with reconfiguration for generalization testing
eval_envs = gym.make(
    args.env_id,
    num_envs=args.num_eval_envs,
    reconfiguration_freq=args.eval_reconfiguration_freq,  # default: 1
    **env_kwargs,
)

if isinstance(eval_envs.action_space, gym.spaces.Dict):
    eval_envs = FlattenActionSpaceWrapper(eval_envs)

# Video recording for evaluation episodes
if args.capture_video:
    eval_envs = RecordEpisode(
        eval_envs,
        output_dir=f"runs/{run_name}/videos",
        save_trajectory=False,
        max_steps_per_video=args.num_eval_steps,
        video_fps=30,
    )

eval_envs = ManiSkillVectorEnv(
    eval_envs,
    args.num_eval_envs,
    ignore_terminations=not args.eval_partial_reset,  # default: True (no partial reset)
    record_metrics=True,
)

Evaluation loop (runs inside the main training loop):

for iteration in range(1, args.num_iterations + 1):
    agent.eval()
    if iteration % args.eval_freq == 1:
        print("Evaluating")
        eval_obs, _ = eval_envs.reset()
        eval_metrics = defaultdict(list)
        num_episodes = 0
        for _ in range(args.num_eval_steps):
            with torch.no_grad():
                eval_obs, eval_rew, eval_terminations, eval_truncations, eval_infos = \
                    eval_envs.step(agent.get_action(eval_obs, deterministic=True))
                if "final_info" in eval_infos:
                    mask = eval_infos["_final_info"]
                    num_episodes += mask.sum()
                    for k, v in eval_infos["final_info"]["episode"].items():
                        eval_metrics[k].append(v)
        print(f"Evaluated {args.num_eval_steps * args.num_eval_envs} steps "
              f"resulting in {num_episodes} episodes")
        for k, v in eval_metrics.items():
            mean = torch.stack(v).float().mean()
            if logger is not None:
                logger.add_scalar(f"eval/{k}", mean, global_step)
            print(f"eval_{k}_mean={mean}")

    # Save model checkpoint at the same frequency
    if args.save_model and iteration % args.eval_freq == 1:
        model_path = f"runs/{run_name}/ckpt_{iteration}.pt"
        torch.save(agent.state_dict(), model_path)
        print(f"model saved to {model_path}")

Final checkpoint save (after training completes):

if not args.evaluate:
    if args.save_model:
        model_path = f"runs/{run_name}/final_ckpt.pt"
        torch.save(agent.state_dict(), model_path)
        print(f"model saved to {model_path}")
    logger.close()
envs.close()
eval_envs.close()

I/O Contract

Evaluation configuration parameters:

Parameter Type Default Description
eval_freq int 25 Evaluation frequency in training iterations
num_eval_envs int 8 Number of parallel evaluation environments
num_eval_steps int 50 Number of steps per evaluation run
eval_partial_reset bool False Whether eval envs use partial reset (default: disabled)
eval_reconfiguration_freq Optional[int] 1 Reconfigure eval env each reset for object randomization
save_model bool True Whether to save model checkpoints
capture_video bool True Whether to record evaluation videos

Evaluation metrics collected:

Metric Key Type Description
success_once float (0 or 1) Whether the task was completed at any point during the episode
success_at_end float (0 or 1) Whether the task is successful at the final step (only when ignore_terminations=True)
return float Cumulative undiscounted reward over the episode
episode_len float Number of steps in the episode
reward float Average reward per step (return / episode_len)
fail_once float (0 or 1) Whether a failure condition was triggered (if the task defines failure)

Checkpoint output:

File Location Contents
Periodic checkpoint runs/{run_name}/ckpt_{iteration}.pt agent.state_dict() at evaluation time
Final checkpoint runs/{run_name}/final_ckpt.pt agent.state_dict() after training completes
Evaluation videos runs/{run_name}/videos/ MP4 video files of evaluation episodes

Usage Examples

Example 1: Run training with evaluation every 25 iterations

python examples/baselines/ppo/ppo.py \
    --env_id="PickCube-v1" \
    --num_envs=512 \
    --eval_freq=25 \
    --num_eval_envs=8 \
    --num_eval_steps=50 \
    --save_model=True \
    --capture_video=True

Example 2: Load a checkpoint and run evaluation only

python examples/baselines/ppo/ppo.py \
    --env_id="PickCube-v1" \
    --evaluate=True \
    --checkpoint="runs/PickCube-v1__ppo__1__1234567890/ckpt_101.pt" \
    --num_eval_envs=8 \
    --num_eval_steps=50

Example 3: Load checkpoint programmatically

import torch

# Create agent with same architecture
agent = Agent(envs).to(device)

# Load saved weights
agent.load_state_dict(torch.load("runs/PickCube-v1__ppo__1__1234567890/final_ckpt.pt"))
agent.eval()

# Run deterministic evaluation
eval_obs, _ = eval_envs.reset()
for step in range(num_eval_steps):
    with torch.no_grad():
        action = agent.get_action(eval_obs, deterministic=True)
    eval_obs, eval_rew, _, _, eval_info = eval_envs.step(action)

Example 4: Interpreting evaluation metrics

# After evaluation completes, metrics are logged:
# eval/success_once_mean=0.85    -> 85% of episodes achieved the goal at least once
# eval/return_mean=45.3          -> average cumulative reward across episodes
# eval/episode_len_mean=38.2     -> average episode length (shorter = faster completion)
# eval/reward_mean=1.19          -> average reward per step

# A good training run shows:
# - success_once increasing over iterations (approaching 1.0)
# - return increasing over iterations
# - episode_len potentially decreasing (faster task completion)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment