Workflow:Iterative Dvc Pipeline Reproduction
| Knowledge Sources | |
|---|---|
| Domains | ML_Pipelines, MLOps, Reproducibility |
| Last Updated | 2026-02-10 10:30 GMT |
Overview
End-to-end process for reproducing DVC pipelines by executing stages in dependency order, skipping stages whose inputs have not changed, and caching results for future reuse.
Description
This workflow covers the pipeline reproduction system that is central to DVC's value proposition. A DVC pipeline is defined in `dvc.yaml` as a directed acyclic graph (DAG) of stages, where each stage specifies commands to run, input dependencies, output artifacts, and optional metrics/parameters. When `dvc repro` is invoked, DVC determines which stages are outdated (their dependencies have changed since the last run), computes the correct execution order via topological sort, and runs only the necessary stages. Results are cached so identical computations are never repeated.
Goal: An up-to-date set of pipeline outputs with a locked `dvc.lock` file recording the exact state of all dependencies and outputs.
Scope: From pipeline definition (`dvc.yaml`) through stage execution to lockfile update.
Strategy: Graph-based dependency resolution with content-hash change detection and run-cache optimization.
Usage
Execute this workflow when:
- You have modified source code, parameters, or input data and need to regenerate downstream outputs
- You want to verify that a pipeline is fully reproducible from its current inputs
- You are setting up a CI/CD pipeline that must rebuild ML artifacts on changes
- You want to selectively re-run parts of a pipeline using target stages
Execution Steps
Step 1: Load Pipeline Definition
DVC reads the `dvc.yaml` file (and any imported pipeline files) to construct the full pipeline graph. Each stage definition is parsed, including its command, dependencies, outputs, parameters, metrics, and plots declarations. Variable interpolation from `params.yaml` is resolved at this point using the parsing/templating engine.
Key considerations:
- `foreach` constructs are expanded into individual stages
- Variable interpolation supports `${item}` syntax with nested resolution
- The `dvc.lock` file is consulted to determine the last known state of each stage
Step 2: Build Dependency Graph
DVC constructs a directed acyclic graph (DAG) where nodes are stages and edges represent dependency relationships. If specific targets are provided, a subgraph is extracted. Frozen stages have their dependency edges removed so they act as fixed inputs.
Key considerations:
- The `--pipeline` flag includes the entire connected pipeline, not just upstream stages
- The `--downstream` flag reverses the graph to include stages that depend on the target
- The `--single-item` flag skips graph construction entirely and runs only the named stage
Step 3: Plan Execution Order
Using depth-first post-order traversal, DVC derives the execution order that ensures all dependencies are satisfied before each stage runs. This produces a linear sequence from the topological sort of the dependency subgraph.
Key considerations:
- Post-order traversal guarantees dependencies execute before dependents
- The execution plan can be previewed with `--dry` mode
- Circular dependencies are rejected during graph construction
Step 4: Check Stage Freshness
For each stage in execution order, DVC compares the current content hashes of all dependencies and outputs against the values recorded in `dvc.lock`. A stage is considered stale if any dependency hash has changed or any output is missing. Fresh stages are skipped entirely.
Key considerations:
- Parameter dependencies track specific keys within YAML/JSON/TOML files
- The run-cache is consulted to check if an identical computation has been performed before
- The `--force` flag bypasses freshness checks and re-runs all stages
- The `--force-downstream` flag forces re-execution of all stages downstream of any changed stage
Step 5: Execute Stages
Each stale stage is executed by spawning a subprocess with the stage's command. DVC captures the process exit code and aborts on failure (unless `--ignore-errors` or `--keep-going` is specified). Before execution, dependencies are verified; after execution, outputs are saved to the cache.
Key considerations:
- Stage execution happens in the stage's working directory
- The `--keep-going` error mode continues execution of independent branches after a failure
- The `--ignore-errors` mode continues all execution regardless of failures
- Failed stages cause downstream dependents to be skipped by default
Step 6: Update Lockfile and Cache
After successful execution, DVC updates `dvc.lock` with the new content hashes of all dependencies and outputs. Output files are transferred to the local cache using the configured link type. The updated lockfile and any modified `.gitignore` files are staged for Git commit.
Key considerations:
- The lockfile records exact hash values, ensuring bit-for-bit reproducibility verification
- Run results are stored in the run-cache for future deduplication
- The `pipeline` field in the lockfile is not updated unless the stage definition itself changed