Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Iterative Dvc Build Graph

From Leeroopedia


Knowledge Sources
Domains Pipeline_Management, Graph_Theory
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for constructing and validating a directed acyclic graph (DAG) from DVC pipeline stage definitions, provided by the DVC library.

Description

The build_graph function in DVC's dvc/repo/graph.py module constructs a networkx.DiGraph from a list of stage objects. Nodes in the graph are stages, and directed edges are created when the output of one stage is consumed as a dependency by another stage. The function performs comprehensive validation during construction: it checks for stages whose definition files reside inside output directories (StagePathAsOutputError), detects cycles using networkx's find_cycle (CyclicGraphError), and relies on a trie data structure (built by build_outs_trie) to efficiently detect overlapping and duplicate output paths.

The companion method Repo.check_graph in dvc/repo/__init__.py provides the high-level entry point used by commands like dvc add and dvc run. It merges new or modified stages into the repository's existing index and then calls check_graph() on the updated index, which internally delegates to build_graph. This ensures that every stage addition or modification is validated against the complete set of stages in the repository.

Edge construction uses the output trie for efficient matching. For each dependency in each stage, the function queries the trie for both prefixes (outputs that are ancestor directories of the dependency) and subtries (outputs that are descendants of the dependency path). Both relationships create a data flow edge between the stages.

Usage

Use build_graph when you need to construct or validate the pipeline DAG programmatically -- for example, when implementing custom pipeline analysis tools, visualization, or when testing stage definitions. Use Repo.check_graph when adding new stages to a repository and need to verify that the addition does not break the DAG invariants.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/repo/graph.py (build_graph), dvc/repo/__init__.py (check_graph)
  • Lines: L80-154 (build_graph), L293-300 (check_graph)

Signature

def build_graph(
    stages: list["Stage"],
    outs_trie: Optional["Trie"] = None,
) -> "DiGraph":
    """Generate a graph by using the given stages.

    Nodes are stages. Edges go from a stage to the stage that produces
    its dependency.

    Raises:
        OutputDuplicationError: two outputs with the same path
        StagePathAsOutputError: stage inside an output directory
        OverlappingOutputPathsError: output inside output directory
        CyclicGraphError: resulting graph has cycles
    """
    ...


# In dvc/repo/__init__.py
class Repo:
    def check_graph(
        self,
        stages: Iterable["Stage"],
        callback: Optional[Callable] = None,
    ) -> None:
        ...

Import

from dvc.repo.graph import build_graph

I/O Contract

Inputs

Name Type Required Description
stages list[Stage] Yes The complete list of pipeline stages to include in the graph. Each stage has deps (dependencies) and outs (outputs) attributes that define the data flow relationships.
outs_trie Optional[Trie] No A pre-built trie of output paths keyed on filesystem path components. If not provided, it is constructed internally by calling build_outs_trie(stages). Providing a pre-built trie avoids redundant computation when validating incrementally.
callback Optional[Callable] No An optional callable invoked after the index update in check_graph, typically used to update progress indicators. Only used by Repo.check_graph.

Outputs

Name Type Description
(build_graph return) networkx.DiGraph A validated directed acyclic graph where nodes are Stage objects and edges represent data flow dependencies. Edges point from a stage to the stage that produces data it depends on.
(check_graph return) None Returns None on success. Raises CyclicGraphError, OverlappingOutputPathsError, OutputDuplicationError, or StagePathAsOutputError on validation failure.

Usage Examples

Basic Usage

from dvc.repo import Repo
from dvc.repo.graph import build_graph, get_pipelines

# Open an existing DVC repository
repo = Repo()

# Build the graph from all stages in the repository index
stages = list(repo.index.stages)
graph = build_graph(stages)

# Inspect graph structure
print(f"Stages: {len(graph.nodes)}")
print(f"Dependencies: {len(graph.edges)}")

# Find independent pipelines (weakly connected components)
pipelines = get_pipelines(graph)
print(f"Independent pipelines: {len(pipelines)}")

# Validate new stages before adding them
new_stages = [my_new_stage]
try:
    repo.check_graph(stages=new_stages)
    print("New stage is valid -- no cycles or conflicts.")
except Exception as e:
    print(f"Validation failed: {e}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment