Principle:Iterative Dvc Experiment Workspace Preparation

Knowledge Sources	DVC Documentation
Domains	Experiment_Management, Reproducibility
Last Updated	2026-02-10 00:00 GMT

Overview

Experiment workspace preparation is the process of creating an isolated, reproducible environment for an experiment by capturing the current workspace state along with parameter modifications in a version-controlled snapshot.

Description

When running machine learning experiments, the workspace -- consisting of source code, configuration files, data references, and pipeline definitions -- must be in a known, reproducible state before execution begins. Experiment workspace preparation solves the problem of ensuring that each experiment starts from a well-defined baseline, with any intended parameter modifications cleanly applied, regardless of the execution environment.

The core mechanism is stash-based versioning: the current workspace state, including any parameter overrides specified by the user, is captured as a Git stash entry. This stash entry serves as a self-contained snapshot that encodes the exact starting conditions for the experiment. The stash entry is assigned a unique identifier and optionally a human-readable name, and it is placed into a queue for execution.

The preparation process supports multiple isolation levels to accommodate different operational requirements. In-place execution runs the experiment directly in the current workspace, which is fast but prevents concurrent experiments. Temporary directory execution clones the workspace into a temp directory, allowing the user to continue working while the experiment runs. Distributed queue execution (via Celery) pushes the stash entry to a persistent task queue where it can be picked up by any worker node, enabling horizontal scaling of experiment execution. In all cases, the preparation step is identical: the workspace state plus parameter modifications are captured as a stash entry; only the execution context differs.

This principle is closely related to Git branching strategies but operates at a finer granularity. Rather than creating full branches for each experiment (which would pollute the branch namespace), experiments use a dedicated ref namespace (refs/exps/) that keeps experiment history separate from the main development history.

Usage

Use experiment workspace preparation when:

You need to guarantee that an experiment starts from a known, reproducible state
You want to run experiments with parameter modifications without altering the current working directory
You need to queue multiple experiments for sequential or parallel execution
You are building a distributed experiment pipeline where workers need self-contained experiment descriptions
You need an undo mechanism -- since the workspace state is stashed, it can be restored if the experiment fails

This is the design trigger for any system that must support concurrent or queued experiments while maintaining reproducibility guarantees.

Theoretical Basis

Experiment workspace preparation follows a capture-isolate-execute pattern:

function prepare_experiment(workspace, overrides, isolation_level):
    # Step 1: Capture baseline state
    baseline_rev = git_get_head_rev()

    # Step 2: Apply parameter modifications
    for each (path, override_list) in overrides:
        apply_overrides(path, override_list)

    # Step 3: Create stash entry with modifications
    stash_entry = git_stash_push(
        include_untracked=True,
        message=encode_metadata(baseline_rev, name)
    )

    # Step 4: Select execution context based on isolation level
    if isolation_level == IN_PLACE:
        queue = workspace_queue
    elif isolation_level == TEMP_DIR:
        queue = tempdir_queue
    elif isolation_level == DISTRIBUTED:
        queue = celery_queue

    # Step 5: Enqueue for execution
    entry = queue.put(stash_entry)
    return entry

The stash entry is the fundamental unit of experiment identity. It contains:

Baseline revision: The Git commit SHA from which the experiment is derived
Modified files: Any parameter files or code changes applied as overrides
Metadata: Experiment name, timestamp, and queue assignment

The name uniqueness constraint ensures that no two experiments under the same baseline share the same name, preventing ambiguity when referencing experiment results. If a name collision is detected and the force flag is not set, the preparation step raises an error.

The stash-based approach provides several theoretical guarantees:

Atomicity: The entire workspace state is captured in a single Git operation
Isolation: Each experiment operates on its own copy of the workspace state
Reversibility: The stash entry can be popped or applied to restore the original state
Portability: The stash entry is a standard Git object that can be transferred between repositories

Related Pages

Implemented By

Implementation:Iterative_Dvc_Experiments_New

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment