Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haosulab ManiSkill RL Evaluation Checkpointing

From Leeroopedia
Field Value
principle_name Haosulab_ManiSkill_RL_Evaluation_Checkpointing
overview Periodic evaluation of RL agent performance and model checkpoint saving during training
domains Reinforcement_Learning, Robotics
last_updated 2026-02-15
related_pages Implementation:Haosulab_ManiSkill_PPO_Eval_Loop

Overview

Description

Evaluation and checkpointing are essential components of the RL training pipeline that serve two purposes: (1) measuring the agent's true performance on the task without exploration noise, and (2) saving model weights at regular intervals so that the best-performing or final policy can be recovered.

Evaluation: During training, the agent's policy is periodically evaluated on a separate set of evaluation environments. The key differences from training rollouts are:

  • Deterministic actions: The policy outputs the mean of its action distribution (no sampling from the Gaussian), providing a cleaner measure of learned behavior without exploration noise
  • Separate environments: Evaluation uses independently configured environments to avoid interference with training state. These may use different reconfiguration frequencies (e.g., reconfiguration_freq=1) to test generalization across different object configurations
  • No gradient computation: Evaluation runs entirely under torch.no_grad() for efficiency
  • Metric aggregation: Success rates, returns, and episode lengths are aggregated across all evaluation episodes

Checkpointing: Model weights (the state_dict of the agent's neural network) are saved to disk at regular intervals. This enables:

  • Recovering from training interruptions
  • Selecting the best model based on evaluation metrics
  • Deploying trained policies for inference or further fine-tuning
  • Analyzing how the policy evolves over the course of training

Key evaluation metrics in robotics manipulation tasks:

  • success_once: Whether the task was completed at any point during the episode (the agent achieved the goal at least once)
  • success_at_end: Whether the task is in a successful state at the final timestep (relevant for tasks where the agent must maintain the goal)
  • return: Cumulative reward over the episode
  • episode_length: Number of steps in the episode (shorter can indicate faster task completion if the task terminates on success)

Usage

Use evaluation and checkpointing during RL training to:

  • Monitor training progress and detect divergence or plateaus
  • Compare performance across different hyperparameter configurations
  • Save models at regular intervals for later analysis or deployment
  • Generate evaluation videos for qualitative assessment of learned behaviors

Evaluation is triggered at a configurable frequency (e.g., every 25 training iterations). The evaluation frequency should be chosen to balance informativeness against computational overhead -- evaluation consumes GPU time that could otherwise be used for training.

Theoretical Basis

Deterministic vs. Stochastic Evaluation: During training, the agent uses a stochastic policy (sampling from a Gaussian distribution) to explore the environment. During evaluation, the agent should behave deterministically by using the mean of the policy distribution. This provides a more reliable measure of what the agent has learned versus what it stumbles upon through random exploration. The deterministic policy is obtained by simply returning the actor network's mean output without adding Gaussian noise.

On-Policy Evaluation Bias: In on-policy algorithms like PPO, training metrics (rewards logged during rollout collection) can be misleading because they include exploration noise and may average over episodes collected under slightly different policy versions within an iteration. Separate evaluation with deterministic actions provides an unbiased assessment of the current policy's performance.

Evaluation Environment Configuration: Evaluation environments are typically configured differently from training environments:

Training vs. Evaluation Environment Configuration
Aspect Training Environments Evaluation Environments
Number of envs Large (e.g., 512) for throughput Small (e.g., 8) for efficiency
Reconfiguration Often disabled (None) Enabled (freq=1) for generalization testing
Partial reset Enabled for continuous training Typically disabled for clean episode boundaries
Recording Optional (low frequency) Enabled for video generation
Actions Stochastic (sampled from policy) Deterministic (mean of policy)

Checkpoint Strategy: Saving model checkpoints at evaluation frequency ensures that every checkpoint has corresponding evaluation metrics. This enables post-training analysis to identify the best checkpoint (highest success rate) rather than always using the final checkpoint, which may not be optimal due to overfitting or instability in late training.

Video Recording for Qualitative Assessment: Quantitative metrics (success rate, return) do not fully capture the quality of learned behaviors. Video recordings of evaluation episodes allow researchers to visually inspect:

  • Whether the agent exhibits smooth, purposeful motions
  • Common failure modes (e.g., grasping failures, collision avoidance)
  • Whether success metrics accurately reflect task completion

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment