Implementation:Danijar Dreamerv3 Plot Benchmark Curves
| Knowledge Sources | |
|---|---|
| Domains | Visualization, Evaluation |
| Last Updated | 2026-02-15 09:00 GMT |
Overview
Concrete tool for loading RL training run data, computing aggregate benchmark statistics, and generating multi-panel comparison plots provided by the DreamerV3 repository.
Description
The plot.py module provides a complete pipeline for benchmark visualization. It discovers training run JSONL files via glob patterns, loads them in parallel using a thread pool executor, bins time-series data into uniform intervals using histogram-based averaging, computes aggregate statistics (mean, median, normalized scores against known baselines), and renders multi-panel matplotlib figures. The module supports multiple benchmark suites including Atari57, DMC, DMLab30, and ProcGen, with automatic baseline normalization from baselines.yaml.
Usage
Use this module when you need to compare training performance of DreamerV3 agents across benchmark environments. It is the primary tool for generating publication-quality training curve plots. Run it as a standalone script with CLI flags to specify input directories, output paths, and aggregation options.
Code Reference
Source Location
- Repository: Danijar_Dreamerv3
- File: plot.py
- Lines: 1-421
Signature
def main(args):
"""
Main entry point for benchmark plotting.
Args:
args: elements.Flags namespace with fields:
pattern (str): Glob for score files (default '**/scores.jsonl')
indirs (list): Input directories containing runs
outdir (str): Output directory for generated plots
methods (str): Regex filter for method names
tasks (str): Regex filter for task names
newstyle (bool): Use new directory naming convention
indir_prefix (bool): Prefix method names with input dir
workers (int): Thread pool size for parallel loading
xkeys (list): Candidate x-axis column names
ykeys (list): Candidate y-axis column names
ythres (float): Threshold for binary success metric
xlim (float): X-axis limit (0 = auto)
ylim (float): Y-axis limit (0 = auto)
binsize (float): Fixed bin size (0 = auto from bins)
bins (int): Number of bins for time-series aggregation
cols (int): Number of subplot columns (0 = auto)
legendcols (int): Legend columns (0 = auto)
size (list): Subplot size [width, height]
xticks (int): Number of x-axis ticks
yticks (int): Number of y-axis ticks
stats (list): Statistic types to compute
agg (bool): Aggregate seeds with mean/std shading
todf (str): Export DataFrame path (empty = skip)
"""
def load_runs(args) -> pd.DataFrame:
"""Load all matching runs into a DataFrame with task, method, seed, xs, ys columns."""
def bin_runs(df: pd.DataFrame, args) -> pd.DataFrame:
"""Bin time-series data into uniform intervals via histogram averaging."""
def comp_stats(df: pd.DataFrame, args) -> Optional[pd.DataFrame]:
"""Compute aggregate statistics optionally normalized against baselines."""
def plot_runs(df: pd.DataFrame, stats: Optional[pd.DataFrame], args) -> None:
"""Generate and save multi-panel comparison figure."""
Import
# Standalone script — run directly:
# python plot.py --indirs /path/to/runs --outdir /path/to/output
# Or import individual functions:
from plot import load_runs, bin_runs, comp_stats, plot_runs
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args.indirs | list[str] | Yes | Directories containing training run subdirectories |
| args.pattern | str | Yes | Glob pattern to find score files (default: `**/scores.jsonl`) |
| args.outdir | str | Yes | Output directory for saved plot images |
| args.methods | str | No | Regex filter for method names (default: `.*`) |
| args.tasks | str | No | Regex filter for task names (default: `.*`) |
| args.xkeys | list[str] | No | Candidate column names for x-axis (default: `['xs', 'step']`) |
| args.ykeys | list[str] | No | Candidate column names for y-axis (default: `['ys', 'episode/score']`) |
| args.bins | int | No | Number of time bins (default: 30) |
| args.stats | list[str] | No | Statistics to compute (default: `['runs', 'auto']`) |
| args.workers | int | No | Thread pool size for parallel loading (default: 16) |
Outputs
| Name | Type | Description |
|---|---|---|
| curves.png | File | Multi-panel PNG figure with per-task curves and aggregate statistics |
| DataFrame (optional) | JSON file | Binned run data if `args.todf` is set |
Usage Examples
Basic Benchmark Plotting
# From command line:
# python plot.py \
# --indirs /path/to/logdir \
# --outdir ./plots \
# --tasks "atari_.*" \
# --stats runs atari_mean atari_median \
# --bins 50
import elements
from plot import main
args = elements.Flags(
pattern='**/scores.jsonl',
indirs=['/path/to/experiment/logdir'],
outdir='./plots',
methods='.*',
tasks='atari_.*',
newstyle=True,
indir_prefix=False,
workers=16,
xkeys=['xs', 'step'],
ykeys=['ys', 'episode/score'],
ythres=0.0,
xlim=0,
ylim=0,
binsize=0,
bins=50,
cols=6,
legendcols=0,
size=[3, 3],
xticks=4,
yticks=10,
stats=['runs', 'atari_mean', 'atari_median'],
agg=True,
todf='',
).parse()
main(args)
# Saves curves.png to ./plots/<indir_name>/curves.png
Loading and Inspecting Run Data
import elements
from plot import load_runs, bin_runs, print_summary
args = elements.Flags(
pattern='**/scores.jsonl',
indirs=['/path/to/logdir'],
methods='.*',
tasks='.*',
newstyle=True,
indir_prefix=False,
workers=16,
xkeys=['xs', 'step'],
ykeys=['ys', 'episode/score'],
ythres=0.0,
xlim=0,
binsize=0,
bins=30,
).parse()
df = load_runs(args)
df = bin_runs(df, args)
print_summary(df)
# Prints method names, task names, and seed counts