Principle:Danijar Dreamerv3 Benchmark Visualization
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Visualization, Reinforcement_Learning |
| Last Updated | 2026-02-15 09:00 GMT |
Overview
Methodology for aggregating, normalizing, and visualizing reinforcement learning training curves across multiple tasks, seeds, and benchmark suites to enable systematic performance comparison.
Description
Benchmark Visualization is the practice of collecting per-step performance data from multiple training runs, aggregating across random seeds, optionally normalizing against known baselines (human performance, random scores), and rendering the results as multi-panel plots. In reinforcement learning, agents are evaluated across many tasks simultaneously (e.g., 57 Atari games, 30 DMLab levels), making standardized comparison critical. This principle addresses the challenge of fairly comparing agents that may have different training budgets, score ranges, and variability across seeds.
The key steps are:
- Data collection: Gather per-step metrics (e.g., episode return) from JSONL log files across tasks, methods, and seeds.
- Time binning: Discretize continuous training steps into uniform intervals and average within each bin.
- Seed aggregation: Compute mean and standard deviation across random seeds for confidence visualization.
- Baseline normalization: Optionally normalize scores to a [0, 1] range using known lower (random) and upper (human or expert) bounds.
- Aggregate statistics: Compute summary metrics across tasks (mean, median, capped mean) for overall comparison.
Usage
Use this principle when evaluating reinforcement learning agents across benchmark suites. It is essential for reproducing the comparison methodology used in DreamerV3 and related papers, where agents must be compared fairly across diverse tasks with different score ranges. Apply this principle whenever you need to generate publication-quality training curves or aggregate performance metrics.
Theoretical Basis
The core normalization operation for a score s on task t given baseline bounds is:
Where s_random and s_human are reference scores (random agent and human performance) for task t.
Aggregate statistics:
- Human-normalized mean: Average of normalized scores across all tasks.
- Human-normalized median: Median of normalized scores, more robust to outliers.
- Capped mean: Normalized scores clipped to [0, 1] before averaging, preventing super-human scores on easy tasks from inflating the aggregate.
Pseudo-code for binned aggregation:
# Abstract algorithm (NOT real implementation)
bins = uniform_bins(0, max_steps, num_bins)
for each bin:
bin_value = mean(scores within bin)
seed_mean = mean(bin_values across seeds)
seed_std = std(bin_values across seeds)