Principle:Iterative Dvc Stage Execution
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Management, Process_Execution |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Stage execution is the process of running a user-defined pipeline command as an isolated subprocess with controlled environment, working directory, and signal handling, optionally skipping redundant computations via run-cache lookup.
Description
Once a pipeline stage has been determined to need re-execution (via freshness detection), the stage execution principle governs how the command is actually run. This involves several responsibilities: subprocess lifecycle management, environment isolation, run-cache deduplication, and output persistence.
The execution begins with a run-cache lookup. Before invoking the user's command, the system checks whether the exact same computation (same command, same dependency checksums) has been executed before. If a cached result exists and the cached outputs are available, the outputs are checked out from cache and execution is skipped entirely. This memoization can save significant time for expensive computations that are being re-run with identical inputs. Only when no valid cache entry exists does the system proceed to actual command execution.
The subprocess management layer handles the mechanics of running the user's command. The command is split into individual lines (multi-line commands are supported) and each line is executed as a separate subprocess invocation. The shell executable is detected from the environment (SHELL on Unix, cmd.exe on Windows), and special flags are applied to prevent the shell from loading configuration files that could alter the execution environment (--noprofile --norc for bash, --no-rcs for zsh, --no-config for fish >= 3.3.0).
Signal handling is critical for proper cleanup. During subprocess execution on the main thread, SIGINT is temporarily ignored in the parent process so that the child process receives the interrupt signal directly, allowing it to handle cleanup. The original signal handler is restored after the subprocess completes. This prevents partial execution state from corrupting the pipeline.
The execution is wrapped with a repository unlock mechanism. Since the DVC repository lock is held during reproduction, the actual command execution temporarily unlocks the repository so that DVC commands invoked by the user's script (e.g., dvc pull within a training script) can acquire the lock. After execution, the repository lock is re-acquired.
After successful command execution, output saving records the current state (checksums, metadata) of all outputs, and output committing transfers output data to the DVC cache. These steps are skipped in dry-run mode.
Usage
Stage execution is the appropriate approach when:
- Pipeline commands need to run in a controlled, reproducible environment with specific working directory and environment variables.
- Run-cache deduplication can save time by skipping expensive computations whose inputs have not changed.
- Multi-line commands need sequential execution with proper error handling on each line.
- Subprocess isolation is required to prevent the pipeline runner from being affected by signals sent to child processes.
- Dry-run mode must be supported for users to preview what would be executed without side effects.
Theoretical Basis
The stage execution algorithm follows a try-cache-then-execute pattern:
PROCEDURE RunStage(stage, dry, force, run_env):
// Phase 0: Clean up previous outputs
IF stage HAS cmd AND NOT frozen AND NOT dry:
REMOVE_OUTPUTS(stage)
// Phase 1: Try run-cache (memoization)
IF NOT force:
IF pull_mode AND NOT dry:
PULL_MISSING_DEPS(stage)
cache_entry = RUN_CACHE.LOOKUP(stage)
IF cache_entry EXISTS:
IF NOT dry:
CHECKOUT_OUTPUTS_FROM_CACHE(cache_entry)
RETURN // skip execution entirely
ELSE:
IF NOT dry:
SAVE_DEPENDENCY_STATE(stage) // record current dep hashes
// Phase 2: Execute command via subprocess
executable = GET_SHELL() // SHELL env var or /bin/sh
commands = SPLIT_LINES(stage.cmd)
env = FIX_ENV(os.environ)
env["DVC_ROOT"] = stage.repo.root_dir
env["DVC_STAGE"] = stage.addressing
IF run_env:
env.UPDATE(run_env)
FOR EACH cmd_line IN commands:
LOG("> " + cmd_line)
IF dry:
CONTINUE
exec_cmd = MAKE_CMD(executable, cmd_line)
// e.g., ["/bin/bash", "--noprofile", "--norc", "-c", cmd_line]
process = SUBPROCESS.POPEN(exec_cmd, cwd=stage.wdir, env=env)
IF MAIN_THREAD:
old_handler = SIGNAL.SET(SIGINT, IGNORE)
process.COMMUNICATE() // wait for completion
IF MAIN_THREAD:
SIGNAL.SET(SIGINT, old_handler)
IF process.returncode != 0:
RAISE StageCmdFailedError(cmd_line, process.returncode)
// Phase 3: Save outputs and commit to cache
IF NOT dry:
SAVE_DEPS(stage)
SAVE_OUTS(stage)
stage.md5 = COMPUTE_MD5(stage)
RUN_CACHE.SAVE(stage) // persist for future memoization
COMMIT_OUTPUTS_TO_CACHE(stage)
Key design principles:
- Fail-fast on subprocess errors: If any command line returns a non-zero exit code, execution stops immediately with StageCmdFailedError.
- Environment purity: Shell configuration files are suppressed (--noprofile --norc) to prevent the user's shell customizations from affecting reproducibility.
- Atomic output saving: Dependency and output states are saved together, and the stage MD5 is computed over the combined state, ensuring consistency.
- Cache-before-execute: The run-cache check happens before any file operations, making the skip path very efficient.