Principle:Haosulab ManiSkill RL Evaluation Checkpointing

Field	Value
principle_name	Haosulab_ManiSkill_RL_Evaluation_Checkpointing
overview	Periodic evaluation of RL agent performance and model checkpoint saving during training
domains	Reinforcement_Learning, Robotics
last_updated	2026-02-15
related_pages	Implementation:Haosulab_ManiSkill_PPO_Eval_Loop

Overview

Description

Evaluation and checkpointing are essential components of the RL training pipeline that serve two purposes: (1) measuring the agent's true performance on the task without exploration noise, and (2) saving model weights at regular intervals so that the best-performing or final policy can be recovered.

Evaluation: During training, the agent's policy is periodically evaluated on a separate set of evaluation environments. The key differences from training rollouts are:

Deterministic actions: The policy outputs the mean of its action distribution (no sampling from the Gaussian), providing a cleaner measure of learned behavior without exploration noise
Separate environments: Evaluation uses independently configured environments to avoid interference with training state. These may use different reconfiguration frequencies (e.g., reconfiguration_freq=1) to test generalization across different object configurations
No gradient computation: Evaluation runs entirely under torch.no_grad() for efficiency
Metric aggregation: Success rates, returns, and episode lengths are aggregated across all evaluation episodes

Checkpointing: Model weights (the state_dict of the agent's neural network) are saved to disk at regular intervals. This enables:

Recovering from training interruptions
Selecting the best model based on evaluation metrics
Deploying trained policies for inference or further fine-tuning
Analyzing how the policy evolves over the course of training

Key evaluation metrics in robotics manipulation tasks:

success_once: Whether the task was completed at any point during the episode (the agent achieved the goal at least once)
success_at_end: Whether the task is in a successful state at the final timestep (relevant for tasks where the agent must maintain the goal)
return: Cumulative reward over the episode
episode_length: Number of steps in the episode (shorter can indicate faster task completion if the task terminates on success)

Usage

Use evaluation and checkpointing during RL training to:

Monitor training progress and detect divergence or plateaus
Compare performance across different hyperparameter configurations
Save models at regular intervals for later analysis or deployment
Generate evaluation videos for qualitative assessment of learned behaviors

Evaluation is triggered at a configurable frequency (e.g., every 25 training iterations). The evaluation frequency should be chosen to balance informativeness against computational overhead -- evaluation consumes GPU time that could otherwise be used for training.

Theoretical Basis

Deterministic vs. Stochastic Evaluation: During training, the agent uses a stochastic policy (sampling from a Gaussian distribution) to explore the environment. During evaluation, the agent should behave deterministically by using the mean of the policy distribution. This provides a more reliable measure of what the agent has learned versus what it stumbles upon through random exploration. The deterministic policy is obtained by simply returning the actor network's mean output without adding Gaussian noise.

On-Policy Evaluation Bias: In on-policy algorithms like PPO, training metrics (rewards logged during rollout collection) can be misleading because they include exploration noise and may average over episodes collected under slightly different policy versions within an iteration. Separate evaluation with deterministic actions provides an unbiased assessment of the current policy's performance.

Evaluation Environment Configuration: Evaluation environments are typically configured differently from training environments:

Training vs. Evaluation Environment Configuration
Aspect	Training Environments	Evaluation Environments
Number of envs	Large (e.g., 512) for throughput	Small (e.g., 8) for efficiency
Reconfiguration	Often disabled (`None`)	Enabled (`freq=1`) for generalization testing
Partial reset	Enabled for continuous training	Typically disabled for clean episode boundaries
Recording	Optional (low frequency)	Enabled for video generation
Actions	Stochastic (sampled from policy)	Deterministic (mean of policy)

Checkpoint Strategy: Saving model checkpoints at evaluation frequency ensures that every checkpoint has corresponding evaluation metrics. This enables post-training analysis to identify the best checkpoint (highest success rate) rather than always using the final checkpoint, which may not be optimal due to overfitting or instability in late training.

Video Recording for Qualitative Assessment: Quantitative metrics (success rate, return) do not fully capture the quality of learned behaviors. Video recordings of evaluation episodes allow researchers to visually inspect:

Whether the agent exhibits smooth, purposeful motions
Common failure modes (e.g., grasping failures, collision avoidance)
Whether success metrics accurately reflect task completion

Related Pages

Implementation:Haosulab_ManiSkill_PPO_Eval_Loop -- Concrete implementation of the evaluation and checkpointing loop
Principle:Haosulab_ManiSkill_PPO_Policy_Optimization -- The training loop that evaluation monitors
Principle:Haosulab_ManiSkill_PPO_Agent_Architecture -- The agent architecture whose weights are checkpointed
Principle:Haosulab_ManiSkill_Environment_Configuration -- How evaluation environments are configured

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment