Workflow:ARISE Initiative Robomimic Trained Policy Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Robot_Learning, Policy_Evaluation, Simulation |
| Last Updated | 2026-02-15 07:30 GMT |
Overview
End-to-end process for loading a trained robomimic policy checkpoint and evaluating it through environment rollouts with optional video recording and dataset generation.
Description
This workflow covers the complete evaluation pipeline for trained robot manipulation policies. Starting from a saved model checkpoint (.pth file), it restores the full policy including network weights, observation normalization statistics, and action normalization statistics. An evaluation environment is reconstructed from the metadata stored in the checkpoint, ensuring consistency with training conditions. The policy is then executed in the environment for multiple rollout episodes, collecting statistics on task success rate, cumulative reward, and episode horizon. Results can be rendered to video or saved as a new HDF5 dataset for downstream analysis.
Usage
Execute this workflow after training is complete and you want to quantitatively evaluate a policy's performance, generate demonstration videos for visualization, or create new datasets from policy rollouts. Common use cases include: benchmarking trained models across different checkpoints, generating videos for papers and presentations, collecting policy rollout data for dataset aggregation (DAgger-style), and comparing algorithm performance across tasks.
Execution Steps
Step 1: Checkpoint Loading
Load the trained model checkpoint from a .pth file. The checkpoint contains the serialized model weights, training configuration, algorithm name, environment metadata, shape metadata, and optionally observation/action normalization statistics. The loading utility deserializes the config, instantiates the correct algorithm class, loads the network weights, and wraps the model in a RolloutPolicy for inference.
Key considerations:
- The checkpoint is self-contained with all metadata needed to reconstruct the environment
- The policy_from_checkpoint utility handles the complete restoration pipeline
- The device (CPU/GPU) can be specified at load time for flexible deployment
- Normalization statistics are automatically applied during inference
Step 2: Environment Reconstruction
Create the simulation environment from the metadata stored in the checkpoint. The environment metadata includes the environment name, robosuite configuration (robot type, controller settings), and any additional parameters. The environment is configured for evaluation mode with appropriate rendering settings (on-screen, off-screen for video, or headless).
Key considerations:
- The environment name can be overridden at evaluation time to test cross-task transfer
- Image observation support is determined from the shape metadata
- Rendering mode is configured based on whether video output is requested
- The environment wrapper chain is applied consistently with training
Step 3: Rollout Execution
Execute the policy in the environment for a specified number of episodes. Each episode begins with an environment reset, then iteratively queries the policy for actions given the current observation, steps the environment, and records the transition. The RolloutPolicy manages recurrent hidden state for RNN-based policies and handles observation normalization transparently. Episodes terminate on task success (if configured), environment done signal, or reaching the maximum horizon.
Key considerations:
- The rollout horizon can be overridden from the command line or defaults to the training config value
- Random seeds can be set for reproducible evaluation
- Camera names control which viewpoints are rendered to video
- Video frames are captured at a configurable skip rate to manage file size
Step 4: Results Collection and Export
Aggregate statistics across all rollout episodes and optionally export results to video files or HDF5 datasets. Statistics include average return, success rate, number of successful episodes, and average horizon. Videos concatenate multiple camera viewpoints horizontally. HDF5 dataset output stores actions, states, rewards, dones, and optionally high-dimensional observations for each episode, enabling re-use of the rollout data.
Key considerations:
- Video output uses imageio at 20 FPS
- Dataset export excludes high-dimensional observations by default to save space
- Observations can be re-extracted later from saved states using the dataset_states_to_obs script
- Environment serialization metadata is stored in the output dataset for reproducibility