Principle:Iterative Dvc Experiment Comparison
| Knowledge Sources | |
|---|---|
| Domains | Experiment_Management, Data_Analysis |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Experiment comparison is the systematic collection, aggregation, and tabular presentation of metrics, parameters, and dependency states across multiple experiment revisions for analysis and ranking.
Description
Running experiments produces a wealth of data -- metrics like accuracy and loss, parameters like learning rate and batch size, dependency hashes that track data provenance -- scattered across multiple Git revisions and ref namespaces. Without a structured comparison mechanism, practitioners must manually check out individual experiments and inspect their outputs, a process that does not scale beyond a handful of runs. Experiment comparison solves this by collecting data from all specified experiments and presenting it in a unified tabular format where each row represents an experiment and each column represents a metric, parameter, or dependency.
The comparison process involves two distinct phases. Collection traverses the experiment ref namespace and workspace to gather ExpState objects, each containing the metrics, parameters, timestamps, and dependency information for a single experiment revision. The collection phase supports filtering by branches, tags, commit ranges, and queue status (queued, failed, workspace). Tabulation transforms the collected states into a structured table, resolving column names across experiments (since different experiments may track different metrics), applying fill values for missing data, and optionally sorting by a specified metric or parameter.
The tabular output is designed for both human consumption (rendered as a terminal table or markdown) and programmatic use (exportable as CSV or dictionary). The column structure is dynamic: it adapts to the set of metrics and parameters present across the compared experiments, automatically adding columns for new metrics and using fill values (typically "-") for experiments that lack a particular metric.
Usage
Use experiment comparison when:
- You have completed multiple experiment runs and need to identify the best-performing configuration
- You want to understand how parameter changes correlate with metric changes
- You need to generate a report or table of experiment results for documentation or review
- You are performing hyperparameter search and need to rank results by a target metric
- You need to audit the dependency state across experiments to verify data provenance
This technique is the design trigger whenever the number of experiments exceeds what can be mentally tracked, or when formal comparison criteria need to be applied.
Theoretical Basis
Experiment comparison follows a collect-normalize-tabulate pipeline:
function compare_experiments(repo, revisions, filters):
# Phase 1: Collection
exp_states = []
for rev in resolve_revisions(revisions, filters):
state = load_exp_state(rev)
# state contains: metrics{path: {name: value}},
# params{path: {name: value}},
# deps{name: hash}, timestamp, name
exp_states.append(state)
# Phase 2: Column Name Resolution
all_metric_names = collect_unique_names(exp_states, "metrics")
all_param_names = collect_unique_names(exp_states, "params")
all_dep_names = collect_unique_names(exp_states, "deps")
# Handle name collisions across files
headers = resolve_ambiguous_names(
all_metric_names, all_param_names
)
# If "accuracy" appears in both metrics.json and eval.json,
# columns become "metrics.json:accuracy" and "eval.json:accuracy"
# Phase 3: Tabulation
table = TabularData(columns=headers, fill_value="-")
for state in exp_states:
row = build_row(state, headers, fill_value="-")
table.append(row)
# Optional: Sort by metric
if sort_by:
table.sort(key=sort_by, order=sort_order)
return table
Column name disambiguation is a key aspect of the normalization phase. When a metric name like "loss" appears in multiple parameter or metric files, the system prefixes it with the file path to avoid ambiguity. The algorithm counts occurrences of each name across all files; names that appear exactly once are used as-is, while names that appear in multiple files are qualified with their file path:
function normalize_headers(names_by_file, global_name_count):
headers = []
for file_path in names_by_file:
for name in names_by_file[file_path]:
if global_name_count[name] == 1:
headers.append(name)
else:
headers.append(file_path + ":" + name)
return headers
The hierarchical structure of the comparison output reflects the experiment lineage: baseline commits form top-level rows, with their derived experiments nested beneath. This tree structure makes it easy to see which experiments belong to which baseline and to compare siblings derived from the same starting point.
Key theoretical properties:
- Schema flexibility: The table schema adapts dynamically to the union of all metrics and parameters across experiments
- Missing data handling: Experiments that lack a particular metric receive a configurable fill value rather than causing errors
- Stable ordering: Column order follows the insertion order of metric and parameter files, providing consistency across repeated comparisons
- Composability: The tabular output can be filtered, sorted, projected, and exported in multiple formats