Implementation:Iterative Dvc Build Graph
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Management, Graph_Theory |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for constructing and validating a directed acyclic graph (DAG) from DVC pipeline stage definitions, provided by the DVC library.
Description
The build_graph function in DVC's dvc/repo/graph.py module constructs a networkx.DiGraph from a list of stage objects. Nodes in the graph are stages, and directed edges are created when the output of one stage is consumed as a dependency by another stage. The function performs comprehensive validation during construction: it checks for stages whose definition files reside inside output directories (StagePathAsOutputError), detects cycles using networkx's find_cycle (CyclicGraphError), and relies on a trie data structure (built by build_outs_trie) to efficiently detect overlapping and duplicate output paths.
The companion method Repo.check_graph in dvc/repo/__init__.py provides the high-level entry point used by commands like dvc add and dvc run. It merges new or modified stages into the repository's existing index and then calls check_graph() on the updated index, which internally delegates to build_graph. This ensures that every stage addition or modification is validated against the complete set of stages in the repository.
Edge construction uses the output trie for efficient matching. For each dependency in each stage, the function queries the trie for both prefixes (outputs that are ancestor directories of the dependency) and subtries (outputs that are descendants of the dependency path). Both relationships create a data flow edge between the stages.
Usage
Use build_graph when you need to construct or validate the pipeline DAG programmatically -- for example, when implementing custom pipeline analysis tools, visualization, or when testing stage definitions. Use Repo.check_graph when adding new stages to a repository and need to verify that the addition does not break the DAG invariants.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/repo/graph.py(build_graph),dvc/repo/__init__.py(check_graph) - Lines: L80-154 (build_graph), L293-300 (check_graph)
Signature
def build_graph(
stages: list["Stage"],
outs_trie: Optional["Trie"] = None,
) -> "DiGraph":
"""Generate a graph by using the given stages.
Nodes are stages. Edges go from a stage to the stage that produces
its dependency.
Raises:
OutputDuplicationError: two outputs with the same path
StagePathAsOutputError: stage inside an output directory
OverlappingOutputPathsError: output inside output directory
CyclicGraphError: resulting graph has cycles
"""
...
# In dvc/repo/__init__.py
class Repo:
def check_graph(
self,
stages: Iterable["Stage"],
callback: Optional[Callable] = None,
) -> None:
...
Import
from dvc.repo.graph import build_graph
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| stages | list[Stage] |
Yes | The complete list of pipeline stages to include in the graph. Each stage has deps (dependencies) and outs (outputs) attributes that define the data flow relationships. |
| outs_trie | Optional[Trie] |
No | A pre-built trie of output paths keyed on filesystem path components. If not provided, it is constructed internally by calling build_outs_trie(stages). Providing a pre-built trie avoids redundant computation when validating incrementally. |
| callback | Optional[Callable] |
No | An optional callable invoked after the index update in check_graph, typically used to update progress indicators. Only used by Repo.check_graph. |
Outputs
| Name | Type | Description |
|---|---|---|
| (build_graph return) | networkx.DiGraph |
A validated directed acyclic graph where nodes are Stage objects and edges represent data flow dependencies. Edges point from a stage to the stage that produces data it depends on. |
| (check_graph return) | None |
Returns None on success. Raises CyclicGraphError, OverlappingOutputPathsError, OutputDuplicationError, or StagePathAsOutputError on validation failure. |
Usage Examples
Basic Usage
from dvc.repo import Repo
from dvc.repo.graph import build_graph, get_pipelines
# Open an existing DVC repository
repo = Repo()
# Build the graph from all stages in the repository index
stages = list(repo.index.stages)
graph = build_graph(stages)
# Inspect graph structure
print(f"Stages: {len(graph.nodes)}")
print(f"Dependencies: {len(graph.edges)}")
# Find independent pipelines (weakly connected components)
pipelines = get_pipelines(graph)
print(f"Independent pipelines: {len(pipelines)}")
# Validate new stages before adding them
new_stages = [my_new_stage]
try:
repo.check_graph(stages=new_stages)
print("New stage is valid -- no cycles or conflicts.")
except Exception as e:
print(f"Validation failed: {e}")