Implementation:Iterative Dvc ProjectFile Dump
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Management, Reproducibility |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for persisting pipeline execution state to lockfiles and run-caches after successful stage execution, provided by the DVC library.
Description
The ProjectFile.dump() method in DVC's dvc.dvcfile module is the primary entry point for persisting a pipeline stage's state after execution. It coordinates two operations: optionally updating the dvc.yaml pipeline file with the stage definition, and updating the dvc.lock lockfile with the stage's current checksums.
For pipeline file updates, _dump_pipeline_file() serializes the stage using serialize.to_pipeline_file() and applies the result to the existing dvc.yaml using apply_diff(), which preserves existing YAML structure, comments, and ordering. If the stage entry already exists, it is updated in place; otherwise, a new entry is added. Parametrized stages (those generated from foreach/matrix templates) cannot be dumped, and attempting to do so raises ParametrizedDumpError.
For lockfile updates, the method delegates to Lockfile.dump_stages(), which serializes each stage using serialize.to_lockfile() and merges the result into the existing dvc.lock file using modify_yaml(). The lockfile uses a "2.0" schema format and is created automatically if it does not exist. Only truly modified stage entries trigger a write and Git tracking update, avoiding unnecessary file system operations.
The StageCache.save() method in dvc/stage/cache.py provides the complementary run-cache persistence. It first checks if the stage is cacheable (has command, dependencies, and outputs; is not a callback or always-changed stage). The cache key is computed from a SHA-256 hash of the stage's dependency state (command + dependency checksums + output paths), and the cache value from a hash of the complete lockfile data. The entry is written to a directory structure <cache_dir>/runs/<key[:2]>/<key>/<value> as a YAML file after schema validation. Before writing, any outputs that have use_cache=False are committed to the DVC cache using copy links to ensure they are preserved.
Usage
Use ProjectFile.dump() and StageCache.save() when you need to:
- Persist stage execution results to the lockfile after successful reproduction.
- Update the pipeline file (dvc.yaml) with modified stage definitions (e.g., after dvc run).
- Save computation results to the run-cache for future deduplication.
- Track lockfile and pipeline file changes in Git via the SCM context.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/dvcfile.py - Lines: L239-261 (ProjectFile.dump and dump_stages), L283-284 (_dump_lockfile), L291-315 (_dump_pipeline_file)
- File:
dvc/dvcfile.py - Lines: L427-453 (Lockfile.dump_stages)
- File:
dvc/stage/cache.py - Lines: L157-190 (StageCache.save)
Signature
class ProjectFile(FileMixin):
def dump(
self,
stage: "Stage",
update_pipeline: bool = True,
update_lock: bool = True,
**kwargs,
) -> None:
...
def dump_stages(
self,
stages: list,
update_pipeline: bool = True,
update_lock: bool = True,
**kwargs,
) -> None:
...
class Lockfile(FileMixin):
def dump_stages(self, stages: list, **kwargs) -> None:
...
def dump(self, stage, **kwargs) -> None:
...
class StageCache:
def __init__(self, repo) -> None:
...
def save(self, stage) -> None:
...
Import
from dvc.dvcfile import ProjectFile, Lockfile
from dvc.stage.cache import StageCache
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| stage | Stage (PipelineStage) | Yes | The executed stage with fresh hash_info on all deps and outs |
| update_pipeline | bool | No | If True (default), update dvc.yaml with the stage definition |
| update_lock | bool | No | If True (default), update dvc.lock with current checksums |
| kwargs | dict | No | Additional keyword arguments passed to lockfile serialization |
Outputs
| Name | Type | Description |
|---|---|---|
| dvc.lock | file (side effect) | Updated lockfile containing cmd, deps checksums, outs checksums, and params values for the stage |
| dvc.yaml | file (side effect) | Updated pipeline file with stage definition (if update_pipeline=True) |
| run-cache entry | file (side effect) | YAML file at <cache>/runs/<key[:2]>/<key>/<value> containing cached lockfile data (from StageCache.save) |
Usage Examples
Basic Usage
from dvc.repo import Repo
repo = Repo(".")
# After reproducing a stage, dump its state
stage = repo.stage.collect("train")[0]
# Dump to both dvc.yaml and dvc.lock
stage.dump(update_pipeline=True, update_lock=True)
# Or dump only lockfile (common during `dvc repro`)
stage.dump(update_pipeline=False, update_lock=True)
Direct Lockfile Update
from dvc.dvcfile import ProjectFile
# Access the project file and dump a stage
project_file = stage.dvcfile
project_file.dump(stage, update_pipeline=False, update_lock=True)
# This updates dvc.lock with the stage's current checksums
# and tracks the lockfile in Git
Run-Cache Save
# The run-cache is saved automatically during Stage.save()
# but can also be invoked directly:
repo.stage_cache.save(stage)
# Computes cache_key from deps + cmd, cache_value from complete state
# Writes to: .dvc/cache/runs/<key[:2]>/<key>/<value>
Reproduction Flow (How Dump Is Called)
# During dvc repro, the flow is:
# 1. stage.run() is called, which internally calls:
# - stage.save() -> saves deps, outs, md5, and run-cache
# - stage.commit() -> transfers outputs to DVC cache
# 2. _reproduce_stage() then calls:
# - stage.dump(update_pipeline=False) -> updates dvc.lock only
# This is the code in dvc/repo/reproduce.py:
def _reproduce_stage(stage, **kwargs):
ret = stage.reproduce(**kwargs)
if ret and not kwargs.get("dry", False):
stage.dump(update_pipeline=False)
return ret