Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ARISE Initiative Robomimic Rollout Evaluation

From Leeroopedia
Knowledge Sources
Domains Robotics, Evaluation, Simulation
Last Updated 2026-02-15 08:00 GMT

Overview

An environment rollout evaluation pattern that deploys trained policies in simulation environments to measure task performance metrics such as success rate, total return, and episode horizon.

Description

Rollout Evaluation is the primary method for measuring the quality of a trained robot manipulation policy. Unlike supervised learning metrics (e.g., MSE on held-out data), rollout evaluation tests the policy in closed-loop interaction with a simulation environment, which is the ground truth for task success.

During each rollout episode, the policy receives observations from the environment, computes actions, and the environment advances by one step. This continues until the maximum horizon is reached, the episode terminates, or a success condition is met. The evaluation collects per-episode statistics (Return, Horizon, Success_Rate) and averages them across multiple episodes per environment.

This principle supports:

  • Multi-environment evaluation: Test across different task variants simultaneously
  • Video recording: Record rollout videos for qualitative inspection
  • Goal-conditioned evaluation: Support for goal-conditioned policies
  • Early termination: Optionally stop episodes upon task success

Usage

Use this principle during training (periodic evaluation checkpoints) or after training (final model evaluation). It requires a trained policy wrapped as a RolloutPolicy and one or more simulation environments. In the training workflow, it is called at regular epoch intervals to track learning progress.

Theoretical Basis

Rollout evaluation implements closed-loop policy evaluation in a Markov Decision Process:

# Abstract rollout evaluation (not real implementation)
def evaluate_policy(policy, env, horizon, num_episodes):
    all_stats = []
    for episode in range(num_episodes):
        obs = env.reset()
        policy.start_episode()
        total_reward = 0
        for t in range(horizon):
            action = policy(obs)
            obs, reward, done, info = env.step(action)
            total_reward += reward
            if done or env.is_success():
                break
        all_stats.append({
            "Return": total_reward,
            "Horizon": t + 1,
            "Success_Rate": float(env.is_success())
        })
    return average(all_stats)

The key metric is Success_Rate, which measures the fraction of episodes where the robot successfully completes the manipulation task.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment