Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Iterative Dvc Lockfile And Cache Update

From Leeroopedia


Knowledge Sources
Domains Pipeline_Management, Reproducibility
Last Updated 2026-02-10 00:00 GMT

Overview

Lockfile and cache update is the process of persisting the exact checksums of all dependencies and outputs after successful stage execution into lockfiles and run-caches, enabling future change detection and computation deduplication.

Description

After a pipeline stage executes successfully, two persistence operations must occur to maintain the reproducibility contract. The lockfile update records the exact state of the stage -- its command, dependency checksums, output checksums, and parameter values -- into a dvc.lock file. This lockfile serves as the ground truth for future change detection: when the pipeline is next evaluated, the freshness detection system compares current state against the lockfile to determine which stages need re-execution.

The run-cache update provides a complementary form of persistence oriented toward computation deduplication. It saves the stage's lockfile data (command, dependency checksums, output checksums) into a content-addressed directory structure within the DVC cache. The key for this cache is derived from a SHA-256 hash of the stage's dependencies (including command and dependency checksums), while the value is a hash of the complete stage data (including output checksums). This enables a lookup pattern: given the current dependencies, find a previously computed result with matching outputs.

The lockfile format follows a structured schema (version "2.0") with a top-level stages dictionary. Each stage entry contains cmd (the command), deps (list of dependency paths and their checksums), outs (list of output paths and their checksums), and optionally params (parameter file paths and their values). The lockfile is updated incrementally -- only modified stage entries are written, preserving unrelated entries.

The separation between lockfile and run-cache serves different purposes. The lockfile is version-controlled (committed to Git) and tracks the most recent execution state for change detection. The run-cache is a local optimization (stored in the DVC cache directory) that enables skipping execution even when the lockfile state does not match (e.g., after reverting to a previous Git commit and re-running).

Usage

Lockfile and cache update should be employed whenever:

  • Reproducibility guarantees are required -- the lockfile captures the precise state needed to reproduce a pipeline run.
  • Change detection is needed for incremental pipeline reproduction -- comparing current state against lockfile state.
  • Computation deduplication can save resources -- the run-cache allows skipping expensive computations that have been performed before with identical inputs.
  • Pipeline state persistence must survive across sessions -- lockfiles are committed to version control, allowing any team member to detect changes relative to the last known good state.
  • Remote sharing of run-cache is desired -- run-cache entries can be pushed to and pulled from remote storage, allowing team members to skip computations that others have already performed.

Theoretical Basis

The lockfile update and run-cache save follow complementary persistence patterns:

PROCEDURE DumpLockfile(stages, lockfile_path):
    existing_data = YAML_LOAD(lockfile_path) OR {"schema": "2.0"}

    FOR EACH stage IN stages:
        stage_data = SERIALIZE_TO_LOCKFILE(stage)
        // stage_data contains:
        //   cmd: "python train.py"
        //   deps: [{path: "data.csv", md5: "abc123", size: 1024}, ...]
        //   params: {params.yaml: {lr: 0.001, epochs: 10}}
        //   outs: [{path: "model.pkl", md5: "def456", size: 50000}, ...]

        IF stage_data != existing_data.stages[stage.name]:
            existing_data.stages[stage.name] = stage_data
            is_modified = True

    IF is_modified:
        YAML_DUMP(lockfile_path, existing_data)
        GIT_TRACK(lockfile_path)


PROCEDURE SaveRunCache(stage, cache_dir):
    // Check if this stage type can be cached
    IF stage.is_callback OR stage.always_changed:
        RETURN  // not cacheable
    IF NOT ALL(stage.cmd, stage.deps, stage.outs):
        RETURN  // incomplete stage

    // Compute cache key from inputs (command + dependency hashes)
    lockfile_data = SERIALIZE_TO_LOCKFILE(stage)
    cache_key = SHA256(lockfile_data with outs reduced to paths only)
    cache_value = SHA256(lockfile_data complete)

    // Check for existing cache entry
    existing = LOAD_CACHE(cache_key, cache_value)
    IF existing:
        RETURN  // already cached, nothing to do

    // Handle uncached outputs (outputs with use_cache=False)
    FOR EACH out IN stage.outs WHERE NOT out.use_cache:
        COMMIT_TO_CACHE(out)  // using copy link type

    // Save cache entry
    cache_path = cache_dir / cache_key[:2] / cache_key / cache_value
    VALIDATE_SCHEMA(lockfile_data)
    YAML_DUMP(cache_path, lockfile_data)


PROCEDURE ProjectFileDump(stage, update_pipeline, update_lock):
    IF update_pipeline:
        // Update dvc.yaml with stage definition
        WITH MODIFY_YAML(dvc.yaml):
            data["stages"][stage.name] = SERIALIZE_TO_PIPELINE(stage)
        GIT_TRACK(dvc.yaml)

    IF update_lock:
        DumpLockfile([stage], dvc.lock)

Key theoretical properties:

  • Incremental updates: Only modified stage entries in the lockfile are updated, preserving existing entries and minimizing unnecessary file writes.
  • Content-addressed caching: The run-cache uses content-based addressing (SHA-256 of stage data) to enable exact-match lookup without ambiguity.
  • Two-level key scheme: The run-cache key is derived from inputs (deps + cmd) while the value is derived from the complete state (deps + outs + cmd). This allows multiple cached results for the same inputs to coexist (e.g., from non-deterministic commands), with the newest result preferred during restoration.
  • Schema validation: Both lockfile and run-cache entries are validated against a schema before writing, ensuring structural integrity.
  • Atomic writes: Run-cache entries are written to a temporary file and then moved to the final path, preventing partial writes from corrupting the cache.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment