Principle:Iterative Dvc Dependency Graph Construction
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Management, Graph_Theory |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Dependency graph construction is the process of building a directed acyclic graph (DAG) from pipeline stages where edges represent data flow relationships between a stage's outputs and another stage's inputs.
Description
In any multi-stage data pipeline, stages are not independent -- one stage's outputs serve as inputs (dependencies) to other stages. Representing these interconnections as a directed graph is fundamental to pipeline management because it enables automated reasoning about execution order, change propagation, parallelization opportunities, and cycle detection.
The dependency graph construction principle addresses how to derive a directed acyclic graph (DAG) from a set of pipeline stages. Each stage becomes a node in the graph. An edge from stage A to stage B means "A depends on an output of B," encoding the data flow direction. The challenge lies in efficiently matching each dependency's path against the set of all outputs across all stages, since a dependency might match an exact output path, a parent directory of an output, or a subdirectory within an output.
A trie (prefix tree) data structure is used to solve this matching problem efficiently. All output paths are inserted into a trie keyed by path components. For each dependency, the trie is queried for both prefix matches (the dependency is inside an output directory) and subtrie matches (the dependency is a parent of an output), allowing O(k) lookups where k is the path depth rather than O(n) comparisons against all outputs.
After edges are established, the graph must be validated to ensure it is acyclic. A cycle would mean stage A depends on B which depends on A, creating an impossible execution order. Additional validation checks for duplicate outputs (two stages producing the same path) and stages located inside output directories (which would cause filesystem conflicts).
Usage
Dependency graph construction is needed whenever:
- Execution ordering must be determined -- the graph is the prerequisite for topological sorting to find valid execution sequences.
- Change propagation analysis is required -- when a stage changes, the graph reveals which downstream stages are affected.
- Pipeline visualization is desired -- the graph provides a complete picture of data flow relationships.
- Validation of pipeline integrity is necessary -- cycle detection and output overlap checking prevent invalid pipeline configurations.
- Pipeline decomposition -- weakly connected components of the graph identify independent sub-pipelines that can be managed and executed separately.
Theoretical Basis
The graph construction algorithm operates in three phases:
PROCEDURE BuildDependencyGraph(stages):
// Phase 1: Build output trie for efficient path matching
outs_trie = NEW Trie()
FOR EACH stage IN stages:
FOR EACH output IN stage.outs:
path_parts = SPLIT_PATH(output.fs_path)
IF outs_trie HAS path_parts:
RAISE OutputDuplicationError
IF outs_trie HAS_SUBTRIE(path_parts) OR outs_trie HAS_PREFIX(path_parts):
RAISE OverlappingOutputPathsError
outs_trie.INSERT(path_parts, output)
// Phase 2: Validate stage paths are not inside outputs
FOR EACH stage IN stages:
prefix = outs_trie.SHORTEST_PREFIX(SPLIT_PATH(stage.path))
IF prefix EXISTS:
RAISE StagePathAsOutputError
// Phase 3: Build graph edges from dependency-output matches
graph = NEW DiGraph()
graph.ADD_NODES(stages)
FOR EACH stage IN stages:
IF stage IS repo_import OR stage IS db_import:
CONTINUE // skip external imports
FOR EACH dep IN stage.deps:
IF dep IS DatasetDependency:
CONTINUE // skip dataset deps
dep_key = SPLIT_PATH(dep.fs_path)
// Find outputs that overlap with this dependency
overlapping = []
// Case 1: dep is inside an output directory
overlapping.APPEND(ALL_PREFIX_VALUES(outs_trie, dep_key))
// Case 2: dep is a parent dir of some outputs
IF outs_trie HAS_SUBTRIE(dep_key):
overlapping.APPEND(ALL_SUBTRIE_VALUES(outs_trie, dep_key))
FOR EACH matched_output IN overlapping:
graph.ADD_EDGE(stage, matched_output.stage)
// Phase 4: Validate acyclicity
IF graph HAS_CYCLE:
RAISE CyclicGraphError
RETURN graph
The graph also supports decomposition into independent pipelines via weakly connected components. Two stages are in the same pipeline if and only if there exists an undirected path between them:
PROCEDURE GetPipelines(graph):
RETURN [graph.SUBGRAPH(component)
FOR component IN WEAKLY_CONNECTED_COMPONENTS(graph)]
Key theoretical properties:
- Trie-based matching achieves O(k) per dependency lookup where k is the maximum path depth, compared to O(n) for linear scanning.
- DAG guarantee -- the acyclicity check ensures a topological ordering always exists.
- Completeness -- both prefix and subtrie matching ensures all data flow relationships are captured, even when dependencies point to directories containing outputs or vice versa.