Workflow:Iterative Dvc Pipeline Reproduction

Knowledge Sources	DVC DVC Docs DVC Repro Reference
Domains	ML_Pipelines, MLOps, Reproducibility
Last Updated	2026-02-10 10:30 GMT

Overview

End-to-end process for reproducing DVC pipelines by executing stages in dependency order, skipping stages whose inputs have not changed, and caching results for future reuse.

Description

This workflow covers the pipeline reproduction system that is central to DVC's value proposition. A DVC pipeline is defined in `dvc.yaml` as a directed acyclic graph (DAG) of stages, where each stage specifies commands to run, input dependencies, output artifacts, and optional metrics/parameters. When `dvc repro` is invoked, DVC determines which stages are outdated (their dependencies have changed since the last run), computes the correct execution order via topological sort, and runs only the necessary stages. Results are cached so identical computations are never repeated.

Goal: An up-to-date set of pipeline outputs with a locked `dvc.lock` file recording the exact state of all dependencies and outputs.

Scope: From pipeline definition (`dvc.yaml`) through stage execution to lockfile update.

Strategy: Graph-based dependency resolution with content-hash change detection and run-cache optimization.

Usage

Execute this workflow when:

You have modified source code, parameters, or input data and need to regenerate downstream outputs
You want to verify that a pipeline is fully reproducible from its current inputs
You are setting up a CI/CD pipeline that must rebuild ML artifacts on changes
You want to selectively re-run parts of a pipeline using target stages

Execution Steps

Step 1: Load Pipeline Definition

DVC reads the `dvc.yaml` file (and any imported pipeline files) to construct the full pipeline graph. Each stage definition is parsed, including its command, dependencies, outputs, parameters, metrics, and plots declarations. Variable interpolation from `params.yaml` is resolved at this point using the parsing/templating engine.

Key considerations:

`foreach` constructs are expanded into individual stages
Variable interpolation supports `${item}` syntax with nested resolution
The `dvc.lock` file is consulted to determine the last known state of each stage

Step 2: Build Dependency Graph

DVC constructs a directed acyclic graph (DAG) where nodes are stages and edges represent dependency relationships. If specific targets are provided, a subgraph is extracted. Frozen stages have their dependency edges removed so they act as fixed inputs.

Key considerations:

The `--pipeline` flag includes the entire connected pipeline, not just upstream stages
The `--downstream` flag reverses the graph to include stages that depend on the target
The `--single-item` flag skips graph construction entirely and runs only the named stage

Step 3: Plan Execution Order

Using depth-first post-order traversal, DVC derives the execution order that ensures all dependencies are satisfied before each stage runs. This produces a linear sequence from the topological sort of the dependency subgraph.

Key considerations:

Post-order traversal guarantees dependencies execute before dependents
The execution plan can be previewed with `--dry` mode
Circular dependencies are rejected during graph construction

Step 4: Check Stage Freshness

For each stage in execution order, DVC compares the current content hashes of all dependencies and outputs against the values recorded in `dvc.lock`. A stage is considered stale if any dependency hash has changed or any output is missing. Fresh stages are skipped entirely.

Key considerations:

Parameter dependencies track specific keys within YAML/JSON/TOML files
The run-cache is consulted to check if an identical computation has been performed before
The `--force` flag bypasses freshness checks and re-runs all stages
The `--force-downstream` flag forces re-execution of all stages downstream of any changed stage

Step 5: Execute Stages

Each stale stage is executed by spawning a subprocess with the stage's command. DVC captures the process exit code and aborts on failure (unless `--ignore-errors` or `--keep-going` is specified). Before execution, dependencies are verified; after execution, outputs are saved to the cache.

Key considerations:

Stage execution happens in the stage's working directory
The `--keep-going` error mode continues execution of independent branches after a failure
The `--ignore-errors` mode continues all execution regardless of failures
Failed stages cause downstream dependents to be skipped by default

Step 6: Update Lockfile and Cache

After successful execution, DVC updates `dvc.lock` with the new content hashes of all dependencies and outputs. Output files are transferred to the local cache using the configured link type. The updated lockfile and any modified `.gitignore` files are staged for Git commit.

Key considerations:

The lockfile records exact hash values, ensuring bit-for-bit reproducibility verification
Run results are stored in the run-cache for future deduplication
The `pipeline` field in the lockfile is not updated unless the stage definition itself changed

Execution Diagram

GitHub URL

Workflow Repository