Principle:Iterative Dvc Pipeline Definition Loading

Knowledge Sources	DVC Documentation
Domains	Pipeline_Management, Configuration_Parsing
Last Updated	2026-02-10 00:00 GMT

Overview

Pipeline definition loading is the process of parsing declarative configuration files into fully executable stage objects by resolving variable interpolations, template expansions, and lockfile state.

Description

In data pipeline systems, the separation between declarative specification and runtime execution is fundamental. Users define pipelines in human-readable YAML configuration files (such as dvc.yaml), specifying stages with commands, dependencies, outputs, parameters, and working directories. These definitions may contain templating constructs -- variable interpolations (e.g., $Template:Item), foreach loops that generate multiple stages from a single template, and matrix definitions that produce combinatorial stage expansions.

The pipeline definition loading principle addresses the challenge of transforming these static, potentially parameterized YAML declarations into concrete, executable stage objects. This involves multiple resolution phases: first, global and stage-level variables are loaded from external parameter files (e.g., params.yaml) and merged into a resolution context; second, interpolated strings within stage definitions are resolved against this context; third, foreach and matrix constructs are expanded into individual stage entries; and finally, the resolved stage data is merged with lockfile state (dvc.lock) to hydrate checksums for dependencies and outputs.

This multi-phase resolution enables a powerful declarative DSL while maintaining reproducibility guarantees. The lockfile merge step is critical: it fills in the exact hash values of dependencies and outputs recorded from the last successful execution, allowing downstream processes to detect whether a stage has changed since its last run.

Usage

Pipeline definition loading should be employed whenever a pipeline system needs to support:

Parameterized pipelines -- where the same stage template is instantiated with different parameter values.
Foreach/matrix expansion -- where a single declaration generates multiple stages over an iterable or a Cartesian product of variable axes.
Lazy stage resolution -- where individual stages are resolved on-demand rather than eagerly loading the entire pipeline, improving performance for large pipelines.
Lockfile-based reproducibility -- where the exact state of a prior execution is recorded and used for change detection in subsequent runs.

The design trigger for this principle is the presence of a declarative configuration layer that must be translated into runtime objects, particularly when the configuration supports templating, parameterization, or references to external variable sources.

Theoretical Basis

The core algorithm for pipeline definition loading follows a multi-phase resolution pattern:

PROCEDURE LoadPipelineDefinitions(config_file, lockfile):
    raw_data = YAML_PARSE(config_file)
    global_vars = LOAD_VARS(raw_data["vars"], default="params.yaml")
    context = CREATE_CONTEXT(global_vars)

    definitions = {}
    FOR EACH (name, stage_def) IN raw_data["stages"]:
        IF stage_def HAS "foreach":
            definitions[name] = ForeachDefinition(context, stage_def)
        ELSE IF stage_def HAS "matrix":
            definitions[name] = MatrixDefinition(context, stage_def)
        ELSE:
            definitions[name] = EntryDefinition(context, stage_def)

    RETURN definitions

PROCEDURE ResolveOne(definitions, name):
    group, key = SPLIT(name, "@")
    definition = definitions[group]

    IF definition IS EntryDefinition:
        resolved = INTERPOLATE(definition.template, definition.context)
    ELSE:
        // ForeachDefinition or MatrixDefinition
        item_context = definition.get_item_context(key)
        resolved = INTERPOLATE(definition.template, MERGE(context, item_context))

    RETURN resolved

PROCEDURE HydrateStage(dvcfile, name, resolved_data, lock_data):
    stage = CREATE_PIPELINE_STAGE(dvcfile, name, resolved_data)
    stage.deps = LOAD_DEPENDENCIES(resolved_data)
    stage.outs = LOAD_OUTPUTS(resolved_data)

    IF lock_data EXISTS:
        stage.cmd_changed = (lock_data["cmd"] != stage.cmd)
        FILL_CHECKSUMS(stage.deps, lock_data["deps"])
        FILL_CHECKSUMS(stage.outs, lock_data["outs"])
        FILL_PARAM_VALUES(stage.params, lock_data["params"])

    RETURN stage

Key theoretical properties of this approach:

Lazy evaluation: Definitions are wrapped in typed objects (EntryDefinition, ForeachDefinition, MatrixDefinition) and resolved only when accessed, avoiding unnecessary computation for large pipelines.
Context isolation: Each stage resolution operates with a cloned or temporarily modified context, preventing side effects between stage resolutions.
Idempotent interpolation: The interpolation process is deterministic -- the same context and template always produce the same resolved output.
Syntax error short-circuiting: For foreach/matrix templates, syntax errors are checked once on the template rather than for each generated stage, providing O(1) instead of O(n) validation cost.

Related Pages

Implemented By

Implementation:Iterative_Dvc_DataResolver_Resolve_One

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment