Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Iterative Dvc ProjectFile Dump

From Leeroopedia


Knowledge Sources
Domains Pipeline_Management, Reproducibility
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for persisting pipeline execution state to lockfiles and run-caches after successful stage execution, provided by the DVC library.

Description

The ProjectFile.dump() method in DVC's dvc.dvcfile module is the primary entry point for persisting a pipeline stage's state after execution. It coordinates two operations: optionally updating the dvc.yaml pipeline file with the stage definition, and updating the dvc.lock lockfile with the stage's current checksums.

For pipeline file updates, _dump_pipeline_file() serializes the stage using serialize.to_pipeline_file() and applies the result to the existing dvc.yaml using apply_diff(), which preserves existing YAML structure, comments, and ordering. If the stage entry already exists, it is updated in place; otherwise, a new entry is added. Parametrized stages (those generated from foreach/matrix templates) cannot be dumped, and attempting to do so raises ParametrizedDumpError.

For lockfile updates, the method delegates to Lockfile.dump_stages(), which serializes each stage using serialize.to_lockfile() and merges the result into the existing dvc.lock file using modify_yaml(). The lockfile uses a "2.0" schema format and is created automatically if it does not exist. Only truly modified stage entries trigger a write and Git tracking update, avoiding unnecessary file system operations.

The StageCache.save() method in dvc/stage/cache.py provides the complementary run-cache persistence. It first checks if the stage is cacheable (has command, dependencies, and outputs; is not a callback or always-changed stage). The cache key is computed from a SHA-256 hash of the stage's dependency state (command + dependency checksums + output paths), and the cache value from a hash of the complete lockfile data. The entry is written to a directory structure <cache_dir>/runs/<key[:2]>/<key>/<value> as a YAML file after schema validation. Before writing, any outputs that have use_cache=False are committed to the DVC cache using copy links to ensure they are preserved.

Usage

Use ProjectFile.dump() and StageCache.save() when you need to:

  • Persist stage execution results to the lockfile after successful reproduction.
  • Update the pipeline file (dvc.yaml) with modified stage definitions (e.g., after dvc run).
  • Save computation results to the run-cache for future deduplication.
  • Track lockfile and pipeline file changes in Git via the SCM context.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/dvcfile.py
  • Lines: L239-261 (ProjectFile.dump and dump_stages), L283-284 (_dump_lockfile), L291-315 (_dump_pipeline_file)
  • File: dvc/dvcfile.py
  • Lines: L427-453 (Lockfile.dump_stages)
  • File: dvc/stage/cache.py
  • Lines: L157-190 (StageCache.save)

Signature

class ProjectFile(FileMixin):
    def dump(
        self,
        stage: "Stage",
        update_pipeline: bool = True,
        update_lock: bool = True,
        **kwargs,
    ) -> None:
        ...

    def dump_stages(
        self,
        stages: list,
        update_pipeline: bool = True,
        update_lock: bool = True,
        **kwargs,
    ) -> None:
        ...


class Lockfile(FileMixin):
    def dump_stages(self, stages: list, **kwargs) -> None:
        ...

    def dump(self, stage, **kwargs) -> None:
        ...


class StageCache:
    def __init__(self, repo) -> None:
        ...

    def save(self, stage) -> None:
        ...

Import

from dvc.dvcfile import ProjectFile, Lockfile
from dvc.stage.cache import StageCache

I/O Contract

Inputs

Name Type Required Description
stage Stage (PipelineStage) Yes The executed stage with fresh hash_info on all deps and outs
update_pipeline bool No If True (default), update dvc.yaml with the stage definition
update_lock bool No If True (default), update dvc.lock with current checksums
kwargs dict No Additional keyword arguments passed to lockfile serialization

Outputs

Name Type Description
dvc.lock file (side effect) Updated lockfile containing cmd, deps checksums, outs checksums, and params values for the stage
dvc.yaml file (side effect) Updated pipeline file with stage definition (if update_pipeline=True)
run-cache entry file (side effect) YAML file at <cache>/runs/<key[:2]>/<key>/<value> containing cached lockfile data (from StageCache.save)

Usage Examples

Basic Usage

from dvc.repo import Repo

repo = Repo(".")

# After reproducing a stage, dump its state
stage = repo.stage.collect("train")[0]

# Dump to both dvc.yaml and dvc.lock
stage.dump(update_pipeline=True, update_lock=True)

# Or dump only lockfile (common during `dvc repro`)
stage.dump(update_pipeline=False, update_lock=True)

Direct Lockfile Update

from dvc.dvcfile import ProjectFile

# Access the project file and dump a stage
project_file = stage.dvcfile
project_file.dump(stage, update_pipeline=False, update_lock=True)
# This updates dvc.lock with the stage's current checksums
# and tracks the lockfile in Git

Run-Cache Save

# The run-cache is saved automatically during Stage.save()
# but can also be invoked directly:
repo.stage_cache.save(stage)
# Computes cache_key from deps + cmd, cache_value from complete state
# Writes to: .dvc/cache/runs/<key[:2]>/<key>/<value>

Reproduction Flow (How Dump Is Called)

# During dvc repro, the flow is:
# 1. stage.run() is called, which internally calls:
#    - stage.save() -> saves deps, outs, md5, and run-cache
#    - stage.commit() -> transfers outputs to DVC cache
# 2. _reproduce_stage() then calls:
#    - stage.dump(update_pipeline=False) -> updates dvc.lock only

# This is the code in dvc/repo/reproduce.py:
def _reproduce_stage(stage, **kwargs):
    ret = stage.reproduce(**kwargs)
    if ret and not kwargs.get("dry", False):
        stage.dump(update_pipeline=False)
    return ret

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment