Principle:Iterative Dvc Experiment Workspace Preparation
| Knowledge Sources | |
|---|---|
| Domains | Experiment_Management, Reproducibility |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Experiment workspace preparation is the process of creating an isolated, reproducible environment for an experiment by capturing the current workspace state along with parameter modifications in a version-controlled snapshot.
Description
When running machine learning experiments, the workspace -- consisting of source code, configuration files, data references, and pipeline definitions -- must be in a known, reproducible state before execution begins. Experiment workspace preparation solves the problem of ensuring that each experiment starts from a well-defined baseline, with any intended parameter modifications cleanly applied, regardless of the execution environment.
The core mechanism is stash-based versioning: the current workspace state, including any parameter overrides specified by the user, is captured as a Git stash entry. This stash entry serves as a self-contained snapshot that encodes the exact starting conditions for the experiment. The stash entry is assigned a unique identifier and optionally a human-readable name, and it is placed into a queue for execution.
The preparation process supports multiple isolation levels to accommodate different operational requirements. In-place execution runs the experiment directly in the current workspace, which is fast but prevents concurrent experiments. Temporary directory execution clones the workspace into a temp directory, allowing the user to continue working while the experiment runs. Distributed queue execution (via Celery) pushes the stash entry to a persistent task queue where it can be picked up by any worker node, enabling horizontal scaling of experiment execution. In all cases, the preparation step is identical: the workspace state plus parameter modifications are captured as a stash entry; only the execution context differs.
This principle is closely related to Git branching strategies but operates at a finer granularity. Rather than creating full branches for each experiment (which would pollute the branch namespace), experiments use a dedicated ref namespace (refs/exps/) that keeps experiment history separate from the main development history.
Usage
Use experiment workspace preparation when:
- You need to guarantee that an experiment starts from a known, reproducible state
- You want to run experiments with parameter modifications without altering the current working directory
- You need to queue multiple experiments for sequential or parallel execution
- You are building a distributed experiment pipeline where workers need self-contained experiment descriptions
- You need an undo mechanism -- since the workspace state is stashed, it can be restored if the experiment fails
This is the design trigger for any system that must support concurrent or queued experiments while maintaining reproducibility guarantees.
Theoretical Basis
Experiment workspace preparation follows a capture-isolate-execute pattern:
function prepare_experiment(workspace, overrides, isolation_level):
# Step 1: Capture baseline state
baseline_rev = git_get_head_rev()
# Step 2: Apply parameter modifications
for each (path, override_list) in overrides:
apply_overrides(path, override_list)
# Step 3: Create stash entry with modifications
stash_entry = git_stash_push(
include_untracked=True,
message=encode_metadata(baseline_rev, name)
)
# Step 4: Select execution context based on isolation level
if isolation_level == IN_PLACE:
queue = workspace_queue
elif isolation_level == TEMP_DIR:
queue = tempdir_queue
elif isolation_level == DISTRIBUTED:
queue = celery_queue
# Step 5: Enqueue for execution
entry = queue.put(stash_entry)
return entry
The stash entry is the fundamental unit of experiment identity. It contains:
- Baseline revision: The Git commit SHA from which the experiment is derived
- Modified files: Any parameter files or code changes applied as overrides
- Metadata: Experiment name, timestamp, and queue assignment
The name uniqueness constraint ensures that no two experiments under the same baseline share the same name, preventing ambiguity when referencing experiment results. If a name collision is detected and the force flag is not set, the preparation step raises an error.
The stash-based approach provides several theoretical guarantees:
- Atomicity: The entire workspace state is captured in a single Git operation
- Isolation: Each experiment operates on its own copy of the workspace state
- Reversibility: The stash entry can be popped or applied to restore the original state
- Portability: The stash entry is a standard Git object that can be transferred between repositories