Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Index Graph Build

From Leeroopedia
Revision as of 15:19, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Iterative_Dvc_Index_Graph_Build.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Pipeline_Management, Graph_Theory
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for building directed acyclic graphs from pipeline stage interconnections and decomposing them into independent pipelines, provided by the DVC library.

Description

The build_graph() function in DVC's dvc.repo.graph module constructs a networkx.DiGraph from a list of pipeline stages. It uses a pygtrie.Trie to index all stage outputs by their filesystem path components, then for each stage's dependencies, queries the trie to find overlapping outputs. Matching dependencies and outputs create directed edges from the dependent stage to the stage that produces the output. The function validates that the resulting graph is acyclic (raising CyclicGraphError if not), that no two stages produce the same output (OutputDuplicationError), and that no stage path resides inside an output directory (StagePathAsOutputError).

The Index.graph cached property on the Index class provides the primary access point, lazily constructing the graph from Index.stages and the pre-built Index.outs_trie. The get_pipelines() helper decomposes the graph into a list of weakly connected component subgraphs, each representing an independent pipeline.

Usage

Use build_graph() and Index.graph when you need to:

  • Determine the dependency structure of all stages in a DVC repository.
  • Validate that the pipeline has no cycles or conflicting outputs.
  • Extract independent sub-pipelines for targeted execution.
  • Feed into execution order planning (plan_repro) for pipeline reproduction.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/repo/graph.py
  • Lines: L80-154 (build_graph), L42-45 (get_pipelines)
  • File: dvc/repo/index.py
  • Lines: L404-408 (Index.graph cached_property)

Signature

def build_graph(stages: list, outs_trie: Optional["Trie"] = None) -> "DiGraph":
    """Generate a graph by using the given stages.

    Nodes are stages. Edges go from a stage to the stage that produces
    its dependency (stage -> dependency_producer).

    Raises:
        OutputDuplicationError: two outputs with the same path
        StagePathAsOutputError: stage inside an output directory
        OverlappingOutputPathsError: output inside output directory
        CyclicGraphError: resulting graph has cycles
    """
    ...


def get_pipelines(graph: "DiGraph") -> list["DiGraph"]:
    """Return list of weakly connected component subgraphs."""
    ...


class Index:
    @cached_property
    def graph(self) -> "DiGraph":
        ...

    @cached_property
    def outs_trie(self) -> "Trie":
        ...

Import

from dvc.repo.graph import build_graph, get_pipelines
from dvc.repo.index import Index

I/O Contract

Inputs

Name Type Required Description
stages list[Stage] Yes List of all pipeline stages collected from the repository index
outs_trie Optional[pygtrie.Trie] No Pre-built trie of output paths keyed by filesystem path parts; built automatically if not provided

Outputs

Name Type Description
graph networkx.DiGraph Directed acyclic graph with Stage objects as nodes and data-dependency edges
pipelines list[networkx.DiGraph] List of weakly connected component subgraphs, each an independent pipeline (from get_pipelines)

Usage Examples

Basic Usage

from dvc.repo import Repo
from dvc.repo.graph import build_graph, get_pipelines

# Open the DVC repository
repo = Repo(".")

# Access the dependency graph through the index (preferred, uses caching)
graph = repo.index.graph

# Inspect graph nodes (stages) and edges (dependencies)
for stage in graph.nodes():
    print(f"Stage: {stage.addressing}")
    predecessors = list(graph.predecessors(stage))
    if predecessors:
        print(f"  Depends on: {[s.addressing for s in predecessors]}")

# Decompose into independent pipelines
pipelines = get_pipelines(graph)
print(f"Found {len(pipelines)} independent pipeline(s)")

for i, pipeline in enumerate(pipelines):
    stages = list(pipeline.nodes())
    print(f"Pipeline {i}: {[s.addressing for s in stages]}")

Direct Graph Construction

from dvc.repo.graph import build_graph
from dvc.repo.trie import build_outs_trie

# Build from a specific set of stages
stages = list(repo.index.stages)
outs_trie = build_outs_trie(stages)
graph = build_graph(stages, outs_trie=outs_trie)

# Check number of edges (data flow connections)
print(f"Stages: {graph.number_of_nodes()}, Dependencies: {graph.number_of_edges()}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment