Principle:ARISE Initiative Robomimic Rollout Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Robotics, Evaluation, Simulation |
| Last Updated | 2026-02-15 08:00 GMT |
Overview
An environment rollout evaluation pattern that deploys trained policies in simulation environments to measure task performance metrics such as success rate, total return, and episode horizon.
Description
Rollout Evaluation is the primary method for measuring the quality of a trained robot manipulation policy. Unlike supervised learning metrics (e.g., MSE on held-out data), rollout evaluation tests the policy in closed-loop interaction with a simulation environment, which is the ground truth for task success.
During each rollout episode, the policy receives observations from the environment, computes actions, and the environment advances by one step. This continues until the maximum horizon is reached, the episode terminates, or a success condition is met. The evaluation collects per-episode statistics (Return, Horizon, Success_Rate) and averages them across multiple episodes per environment.
This principle supports:
- Multi-environment evaluation: Test across different task variants simultaneously
- Video recording: Record rollout videos for qualitative inspection
- Goal-conditioned evaluation: Support for goal-conditioned policies
- Early termination: Optionally stop episodes upon task success
Usage
Use this principle during training (periodic evaluation checkpoints) or after training (final model evaluation). It requires a trained policy wrapped as a RolloutPolicy and one or more simulation environments. In the training workflow, it is called at regular epoch intervals to track learning progress.
Theoretical Basis
Rollout evaluation implements closed-loop policy evaluation in a Markov Decision Process:
# Abstract rollout evaluation (not real implementation)
def evaluate_policy(policy, env, horizon, num_episodes):
all_stats = []
for episode in range(num_episodes):
obs = env.reset()
policy.start_episode()
total_reward = 0
for t in range(horizon):
action = policy(obs)
obs, reward, done, info = env.step(action)
total_reward += reward
if done or env.is_success():
break
all_stats.append({
"Return": total_reward,
"Horizon": t + 1,
"Success_Rate": float(env.is_success())
})
return average(all_stats)
The key metric is Success_Rate, which measures the fraction of episodes where the robot successfully completes the manipulation task.