Principle:Iterative Dvc Pipeline Definition Loading
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Management, Configuration_Parsing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Pipeline definition loading is the process of parsing declarative configuration files into fully executable stage objects by resolving variable interpolations, template expansions, and lockfile state.
Description
In data pipeline systems, the separation between declarative specification and runtime execution is fundamental. Users define pipelines in human-readable YAML configuration files (such as dvc.yaml), specifying stages with commands, dependencies, outputs, parameters, and working directories. These definitions may contain templating constructs -- variable interpolations (e.g., $Template:Item), foreach loops that generate multiple stages from a single template, and matrix definitions that produce combinatorial stage expansions.
The pipeline definition loading principle addresses the challenge of transforming these static, potentially parameterized YAML declarations into concrete, executable stage objects. This involves multiple resolution phases: first, global and stage-level variables are loaded from external parameter files (e.g., params.yaml) and merged into a resolution context; second, interpolated strings within stage definitions are resolved against this context; third, foreach and matrix constructs are expanded into individual stage entries; and finally, the resolved stage data is merged with lockfile state (dvc.lock) to hydrate checksums for dependencies and outputs.
This multi-phase resolution enables a powerful declarative DSL while maintaining reproducibility guarantees. The lockfile merge step is critical: it fills in the exact hash values of dependencies and outputs recorded from the last successful execution, allowing downstream processes to detect whether a stage has changed since its last run.
Usage
Pipeline definition loading should be employed whenever a pipeline system needs to support:
- Parameterized pipelines -- where the same stage template is instantiated with different parameter values.
- Foreach/matrix expansion -- where a single declaration generates multiple stages over an iterable or a Cartesian product of variable axes.
- Lazy stage resolution -- where individual stages are resolved on-demand rather than eagerly loading the entire pipeline, improving performance for large pipelines.
- Lockfile-based reproducibility -- where the exact state of a prior execution is recorded and used for change detection in subsequent runs.
The design trigger for this principle is the presence of a declarative configuration layer that must be translated into runtime objects, particularly when the configuration supports templating, parameterization, or references to external variable sources.
Theoretical Basis
The core algorithm for pipeline definition loading follows a multi-phase resolution pattern:
PROCEDURE LoadPipelineDefinitions(config_file, lockfile):
raw_data = YAML_PARSE(config_file)
global_vars = LOAD_VARS(raw_data["vars"], default="params.yaml")
context = CREATE_CONTEXT(global_vars)
definitions = {}
FOR EACH (name, stage_def) IN raw_data["stages"]:
IF stage_def HAS "foreach":
definitions[name] = ForeachDefinition(context, stage_def)
ELSE IF stage_def HAS "matrix":
definitions[name] = MatrixDefinition(context, stage_def)
ELSE:
definitions[name] = EntryDefinition(context, stage_def)
RETURN definitions
PROCEDURE ResolveOne(definitions, name):
group, key = SPLIT(name, "@")
definition = definitions[group]
IF definition IS EntryDefinition:
resolved = INTERPOLATE(definition.template, definition.context)
ELSE:
// ForeachDefinition or MatrixDefinition
item_context = definition.get_item_context(key)
resolved = INTERPOLATE(definition.template, MERGE(context, item_context))
RETURN resolved
PROCEDURE HydrateStage(dvcfile, name, resolved_data, lock_data):
stage = CREATE_PIPELINE_STAGE(dvcfile, name, resolved_data)
stage.deps = LOAD_DEPENDENCIES(resolved_data)
stage.outs = LOAD_OUTPUTS(resolved_data)
IF lock_data EXISTS:
stage.cmd_changed = (lock_data["cmd"] != stage.cmd)
FILL_CHECKSUMS(stage.deps, lock_data["deps"])
FILL_CHECKSUMS(stage.outs, lock_data["outs"])
FILL_PARAM_VALUES(stage.params, lock_data["params"])
RETURN stage
Key theoretical properties of this approach:
- Lazy evaluation: Definitions are wrapped in typed objects (EntryDefinition, ForeachDefinition, MatrixDefinition) and resolved only when accessed, avoiding unnecessary computation for large pipelines.
- Context isolation: Each stage resolution operates with a cloned or temporarily modified context, preventing side effects between stage resolutions.
- Idempotent interpolation: The interpolation process is deterministic -- the same context and template always produce the same resolved output.
- Syntax error short-circuiting: For foreach/matrix templates, syntax errors are checked once on the template rather than for each generated stage, providing O(1) instead of O(n) validation cost.