Heuristic:Iterative Dvc Run Cache Restoration Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Pipeline_Execution |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Pipeline optimization that skips stage execution by restoring outputs from a run cache when inputs have not changed.
Description
DVC maintains a run cache that maps the hash of a stage's dependencies (inputs, parameters, command) to the hash of its outputs. Before executing a stage command, DVC checks if this exact combination of inputs has been seen before. If a match is found, the cached outputs are restored directly without re-running the command. This is implemented in `dvc/stage/cache.py` and used as a "try restore first, then run" pattern in `dvc/stage/run.py`.
Usage
This heuristic is applied automatically during pipeline reproduction (`dvc repro`). It is most beneficial for stages with expensive computations (model training, data processing) that are re-run with the same inputs. The run cache works alongside the regular DVC cache and can be stored on remote storage.
The Insight (Rule of Thumb)
- Action: Before running a stage, attempt `stage.repo.stage_cache.restore(stage)`. Only run the command if `RunCacheNotFoundError` is raised.
- Value: Eliminates redundant computation for stages with unchanged inputs.
- Trade-off: Run cache lookup adds a small overhead per stage. The cache itself uses disk space proportional to the number of unique input-output combinations seen.
- Compatibility: Uses a temporary stage object to avoid accidentally modifying the original stage state during restoration.
Reasoning
Pipeline reproduction frequently involves re-running stages that have not changed. Without the run cache, DVC would need to execute every stage command even if the identical computation was performed previously. The run cache acts as a memoization layer at the pipeline stage level.
The implementation uses a temporary stage copy during restoration to work around a limitation where `commit/checkout` operations do not work correctly for uncached outputs. This is documented in the source code as a known workaround.
Code Evidence
Restore-before-run pattern from `dvc/stage/run.py:166-182`:
def run_stage(stage, dry=False, force=False, run_env=None, **kwargs):
if not force:
if kwargs.get("pull") and not dry:
_pull_missing_deps(stage)
from .cache import RunCacheNotFoundError
try:
stage.repo.stage_cache.restore(stage, dry=dry, **kwargs)
if not dry:
return
except RunCacheNotFoundError:
if not dry:
stage.save_deps()
run = cmd_run if dry else unlocked_repo(cmd_run)
run(stage, dry=dry, run_env=run_env)
Temporary stage workaround from `dvc/stage/cache.py:144-152`:
# NOTE: using temporary stage to avoid accidentally modifying
# original stage and to workaround 'commit/checkout' not working
# for uncached outputs.
# NOTE: using copy link to make it look like a git-tracked file
Run cache key computation from `dvc/stage/cache.py` involves hashing all stage dependencies:
# The cache key is derived from:
# - stage command
# - dependency hashes (inputs)
# - parameter values
# This creates a unique identifier for each input combination