Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Danijar Dreamerv3 Plot Benchmark Curves

From Leeroopedia
Knowledge Sources
Domains Visualization, Evaluation
Last Updated 2026-02-15 09:00 GMT

Overview

Concrete tool for loading RL training run data, computing aggregate benchmark statistics, and generating multi-panel comparison plots provided by the DreamerV3 repository.

Description

The plot.py module provides a complete pipeline for benchmark visualization. It discovers training run JSONL files via glob patterns, loads them in parallel using a thread pool executor, bins time-series data into uniform intervals using histogram-based averaging, computes aggregate statistics (mean, median, normalized scores against known baselines), and renders multi-panel matplotlib figures. The module supports multiple benchmark suites including Atari57, DMC, DMLab30, and ProcGen, with automatic baseline normalization from baselines.yaml.

Usage

Use this module when you need to compare training performance of DreamerV3 agents across benchmark environments. It is the primary tool for generating publication-quality training curve plots. Run it as a standalone script with CLI flags to specify input directories, output paths, and aggregation options.

Code Reference

Source Location

Signature

def main(args):
    """
    Main entry point for benchmark plotting.

    Args:
        args: elements.Flags namespace with fields:
            pattern (str): Glob for score files (default '**/scores.jsonl')
            indirs (list): Input directories containing runs
            outdir (str): Output directory for generated plots
            methods (str): Regex filter for method names
            tasks (str): Regex filter for task names
            newstyle (bool): Use new directory naming convention
            indir_prefix (bool): Prefix method names with input dir
            workers (int): Thread pool size for parallel loading
            xkeys (list): Candidate x-axis column names
            ykeys (list): Candidate y-axis column names
            ythres (float): Threshold for binary success metric
            xlim (float): X-axis limit (0 = auto)
            ylim (float): Y-axis limit (0 = auto)
            binsize (float): Fixed bin size (0 = auto from bins)
            bins (int): Number of bins for time-series aggregation
            cols (int): Number of subplot columns (0 = auto)
            legendcols (int): Legend columns (0 = auto)
            size (list): Subplot size [width, height]
            xticks (int): Number of x-axis ticks
            yticks (int): Number of y-axis ticks
            stats (list): Statistic types to compute
            agg (bool): Aggregate seeds with mean/std shading
            todf (str): Export DataFrame path (empty = skip)
    """

def load_runs(args) -> pd.DataFrame:
    """Load all matching runs into a DataFrame with task, method, seed, xs, ys columns."""

def bin_runs(df: pd.DataFrame, args) -> pd.DataFrame:
    """Bin time-series data into uniform intervals via histogram averaging."""

def comp_stats(df: pd.DataFrame, args) -> Optional[pd.DataFrame]:
    """Compute aggregate statistics optionally normalized against baselines."""

def plot_runs(df: pd.DataFrame, stats: Optional[pd.DataFrame], args) -> None:
    """Generate and save multi-panel comparison figure."""

Import

# Standalone script — run directly:
# python plot.py --indirs /path/to/runs --outdir /path/to/output

# Or import individual functions:
from plot import load_runs, bin_runs, comp_stats, plot_runs

I/O Contract

Inputs

Name Type Required Description
args.indirs list[str] Yes Directories containing training run subdirectories
args.pattern str Yes Glob pattern to find score files (default: `**/scores.jsonl`)
args.outdir str Yes Output directory for saved plot images
args.methods str No Regex filter for method names (default: `.*`)
args.tasks str No Regex filter for task names (default: `.*`)
args.xkeys list[str] No Candidate column names for x-axis (default: `['xs', 'step']`)
args.ykeys list[str] No Candidate column names for y-axis (default: `['ys', 'episode/score']`)
args.bins int No Number of time bins (default: 30)
args.stats list[str] No Statistics to compute (default: `['runs', 'auto']`)
args.workers int No Thread pool size for parallel loading (default: 16)

Outputs

Name Type Description
curves.png File Multi-panel PNG figure with per-task curves and aggregate statistics
DataFrame (optional) JSON file Binned run data if `args.todf` is set

Usage Examples

Basic Benchmark Plotting

# From command line:
# python plot.py \
#   --indirs /path/to/logdir \
#   --outdir ./plots \
#   --tasks "atari_.*" \
#   --stats runs atari_mean atari_median \
#   --bins 50

import elements
from plot import main

args = elements.Flags(
    pattern='**/scores.jsonl',
    indirs=['/path/to/experiment/logdir'],
    outdir='./plots',
    methods='.*',
    tasks='atari_.*',
    newstyle=True,
    indir_prefix=False,
    workers=16,
    xkeys=['xs', 'step'],
    ykeys=['ys', 'episode/score'],
    ythres=0.0,
    xlim=0,
    ylim=0,
    binsize=0,
    bins=50,
    cols=6,
    legendcols=0,
    size=[3, 3],
    xticks=4,
    yticks=10,
    stats=['runs', 'atari_mean', 'atari_median'],
    agg=True,
    todf='',
).parse()

main(args)
# Saves curves.png to ./plots/<indir_name>/curves.png

Loading and Inspecting Run Data

import elements
from plot import load_runs, bin_runs, print_summary

args = elements.Flags(
    pattern='**/scores.jsonl',
    indirs=['/path/to/logdir'],
    methods='.*',
    tasks='.*',
    newstyle=True,
    indir_prefix=False,
    workers=16,
    xkeys=['xs', 'step'],
    ykeys=['ys', 'episode/score'],
    ythres=0.0,
    xlim=0,
    binsize=0,
    bins=30,
).parse()

df = load_runs(args)
df = bin_runs(df, args)
print_summary(df)
# Prints method names, task names, and seed counts

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment