Principle:Iterative Dvc Experiment Comparison

Knowledge Sources	DVC Documentation
Domains	Experiment_Management, Data_Analysis
Last Updated	2026-02-10 00:00 GMT

Overview

Experiment comparison is the systematic collection, aggregation, and tabular presentation of metrics, parameters, and dependency states across multiple experiment revisions for analysis and ranking.

Description

Running experiments produces a wealth of data -- metrics like accuracy and loss, parameters like learning rate and batch size, dependency hashes that track data provenance -- scattered across multiple Git revisions and ref namespaces. Without a structured comparison mechanism, practitioners must manually check out individual experiments and inspect their outputs, a process that does not scale beyond a handful of runs. Experiment comparison solves this by collecting data from all specified experiments and presenting it in a unified tabular format where each row represents an experiment and each column represents a metric, parameter, or dependency.

The comparison process involves two distinct phases. Collection traverses the experiment ref namespace and workspace to gather ExpState objects, each containing the metrics, parameters, timestamps, and dependency information for a single experiment revision. The collection phase supports filtering by branches, tags, commit ranges, and queue status (queued, failed, workspace). Tabulation transforms the collected states into a structured table, resolving column names across experiments (since different experiments may track different metrics), applying fill values for missing data, and optionally sorting by a specified metric or parameter.

The tabular output is designed for both human consumption (rendered as a terminal table or markdown) and programmatic use (exportable as CSV or dictionary). The column structure is dynamic: it adapts to the set of metrics and parameters present across the compared experiments, automatically adding columns for new metrics and using fill values (typically "-") for experiments that lack a particular metric.

Usage

Use experiment comparison when:

You have completed multiple experiment runs and need to identify the best-performing configuration
You want to understand how parameter changes correlate with metric changes
You need to generate a report or table of experiment results for documentation or review
You are performing hyperparameter search and need to rank results by a target metric
You need to audit the dependency state across experiments to verify data provenance

This technique is the design trigger whenever the number of experiments exceeds what can be mentally tracked, or when formal comparison criteria need to be applied.

Theoretical Basis

Experiment comparison follows a collect-normalize-tabulate pipeline:

function compare_experiments(repo, revisions, filters):
    # Phase 1: Collection
    exp_states = []
    for rev in resolve_revisions(revisions, filters):
        state = load_exp_state(rev)
        # state contains: metrics{path: {name: value}},
        #                  params{path: {name: value}},
        #                  deps{name: hash}, timestamp, name
        exp_states.append(state)

    # Phase 2: Column Name Resolution
    all_metric_names = collect_unique_names(exp_states, "metrics")
    all_param_names = collect_unique_names(exp_states, "params")
    all_dep_names = collect_unique_names(exp_states, "deps")

    # Handle name collisions across files
    headers = resolve_ambiguous_names(
        all_metric_names, all_param_names
    )
    # If "accuracy" appears in both metrics.json and eval.json,
    # columns become "metrics.json:accuracy" and "eval.json:accuracy"

    # Phase 3: Tabulation
    table = TabularData(columns=headers, fill_value="-")
    for state in exp_states:
        row = build_row(state, headers, fill_value="-")
        table.append(row)

    # Optional: Sort by metric
    if sort_by:
        table.sort(key=sort_by, order=sort_order)

    return table

Column name disambiguation is a key aspect of the normalization phase. When a metric name like "loss" appears in multiple parameter or metric files, the system prefixes it with the file path to avoid ambiguity. The algorithm counts occurrences of each name across all files; names that appear exactly once are used as-is, while names that appear in multiple files are qualified with their file path:

function normalize_headers(names_by_file, global_name_count):
    headers = []
    for file_path in names_by_file:
        for name in names_by_file[file_path]:
            if global_name_count[name] == 1:
                headers.append(name)
            else:
                headers.append(file_path + ":" + name)
    return headers

The hierarchical structure of the comparison output reflects the experiment lineage: baseline commits form top-level rows, with their derived experiments nested beneath. This tree structure makes it easy to see which experiments belong to which baseline and to compare siblings derived from the same starting point.

Key theoretical properties:

Schema flexibility: The table schema adapts dynamically to the union of all metrics and parameters across experiments
Missing data handling: Experiments that lack a particular metric receive a configurable fill value rather than causing errors
Stable ordering: Column order follows the insertion order of metric and parameter files, providing consistency across repeated comparisons
Composability: The tabular output can be filtered, sorted, projected, and exported in multiple formats

Related Pages

Implemented By

Implementation:Iterative_Dvc_Experiments_Show

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment