Workflow:ARISE Initiative Robomimic Trained Policy Evaluation

Knowledge Sources	Robomimic Robomimic Docs Using Pretrained Models
Domains	Robot_Learning, Policy_Evaluation, Simulation
Last Updated	2026-02-15 07:30 GMT

Overview

End-to-end process for loading a trained robomimic policy checkpoint and evaluating it through environment rollouts with optional video recording and dataset generation.

Description

This workflow covers the complete evaluation pipeline for trained robot manipulation policies. Starting from a saved model checkpoint (.pth file), it restores the full policy including network weights, observation normalization statistics, and action normalization statistics. An evaluation environment is reconstructed from the metadata stored in the checkpoint, ensuring consistency with training conditions. The policy is then executed in the environment for multiple rollout episodes, collecting statistics on task success rate, cumulative reward, and episode horizon. Results can be rendered to video or saved as a new HDF5 dataset for downstream analysis.

Usage

Execute this workflow after training is complete and you want to quantitatively evaluate a policy's performance, generate demonstration videos for visualization, or create new datasets from policy rollouts. Common use cases include: benchmarking trained models across different checkpoints, generating videos for papers and presentations, collecting policy rollout data for dataset aggregation (DAgger-style), and comparing algorithm performance across tasks.

Execution Steps

Step 1: Checkpoint Loading

Load the trained model checkpoint from a .pth file. The checkpoint contains the serialized model weights, training configuration, algorithm name, environment metadata, shape metadata, and optionally observation/action normalization statistics. The loading utility deserializes the config, instantiates the correct algorithm class, loads the network weights, and wraps the model in a RolloutPolicy for inference.

Key considerations:

The checkpoint is self-contained with all metadata needed to reconstruct the environment
The policy_from_checkpoint utility handles the complete restoration pipeline
The device (CPU/GPU) can be specified at load time for flexible deployment
Normalization statistics are automatically applied during inference

Step 2: Environment Reconstruction

Create the simulation environment from the metadata stored in the checkpoint. The environment metadata includes the environment name, robosuite configuration (robot type, controller settings), and any additional parameters. The environment is configured for evaluation mode with appropriate rendering settings (on-screen, off-screen for video, or headless).

Key considerations:

The environment name can be overridden at evaluation time to test cross-task transfer
Image observation support is determined from the shape metadata
Rendering mode is configured based on whether video output is requested
The environment wrapper chain is applied consistently with training

Step 3: Rollout Execution

Execute the policy in the environment for a specified number of episodes. Each episode begins with an environment reset, then iteratively queries the policy for actions given the current observation, steps the environment, and records the transition. The RolloutPolicy manages recurrent hidden state for RNN-based policies and handles observation normalization transparently. Episodes terminate on task success (if configured), environment done signal, or reaching the maximum horizon.

Key considerations:

The rollout horizon can be overridden from the command line or defaults to the training config value
Random seeds can be set for reproducible evaluation
Camera names control which viewpoints are rendered to video
Video frames are captured at a configurable skip rate to manage file size

Step 4: Results Collection and Export

Aggregate statistics across all rollout episodes and optionally export results to video files or HDF5 datasets. Statistics include average return, success rate, number of successful episodes, and average horizon. Videos concatenate multiple camera viewpoints horizontally. HDF5 dataset output stores actions, states, rewards, dones, and optionally high-dimensional observations for each episode, enabling re-use of the rollout data.

Key considerations:

Video output uses imageio at 20 FPS
Dataset export excludes high-dimensional observations by default to save space
Observations can be re-extracted later from saved states using the dataset_states_to_obs script
Environment serialization metadata is stored in the output dataset for reproducibility

Execution Diagram

GitHub URL

Workflow Repository