Implementation:Iterative Dvc Stage Utils
Domains
Pipeline_Management, Validation
Overview
Concrete tool for validating, processing, and serializing DVC stage configurations. The module dvc/stage/utils.py provides a collection of utility functions used throughout the stage lifecycle, including path validation, output/dependency population, integrity checks, MD5 computation, and stage serialization.
Description
This module contains stateless utility functions that operate on DVC Stage and PipelineStage objects. The functions fall into several categories:
Path validation:
check_stage_path()-- validates that a stage path exists, is a directory, and is within the DVC project root.resolve_wdir()-- computes the relative working directory for serialization.resolve_paths()-- resolves absolute paths for a stage file and its working directory.
Output and dependency population:
fill_stage_outputs()-- populatesstage.outsfrom keyword arguments covering all output types (outs, metrics, plots, with persist/no_cache variants).fill_stage_dependencies()-- populatesstage.depsfrom deps, erepo, params, fs_config, and db arguments.
Integrity checks:
check_circular_dependency()-- raisesCircularDependencyErrorif any dependency path also appears as an output path.check_duplicated_arguments()-- raisesArgumentDuplicationErrorif any path appears more than once across deps and outs.check_missing_outputs()-- raisesMissingDataSourceif any output does not exist on the filesystem.check_no_externals()-- raisesStageExternalOutputsErrorif any cached output is outside the DVC repo.
Serialization and computation:
compute_md5()-- computes an MD5 hash for a stage, excluding metadata, annotations, and certain output fields for backward compatibility.get_dump()-- serializes a stage to a dictionary suitable for writing todvc.yaml/dvc.lock.validate_kwargs()-- validates CLI-provided keyword arguments for stage creation.
Other utilities:
split_params_deps()-- splits stage dependencies intoParamsDependencyand regularDependencylists.is_valid_name()-- checks that a stage name contains no invalid characters.prepare_file_path()-- derives a.dvcfile path from the first output name.check_stage_exists()-- checks for duplicate stage names or existing.dvcfiles.
Signature
def check_stage_path(repo, path, is_wdir=False): ...
def fill_stage_outputs(stage, **kwargs): ...
def fill_stage_dependencies(
stage, deps=None, erepo=None, params=None, fs_config=None, db=None
): ...
def check_circular_dependency(stage): ...
def check_duplicated_arguments(stage): ...
def check_missing_outputs(stage): ...
def check_no_externals(stage): ...
def compute_md5(stage): ...
def get_dump(stage: "Stage", **kwargs): ...
def validate_kwargs(
single_stage: bool = False, fname: Optional[str] = None, **kwargs
) -> dict[str, Any]: ...
def split_params_deps(
stage: "Stage",
) -> tuple[list["ParamsDependency"], list["Dependency"]]: ...
def is_valid_name(name: str) -> bool: ...
def prepare_file_path(kwargs) -> str: ...
def check_stage_exists(
repo: "Repo", stage: Union["Stage", "PipelineStage"], path: str
): ...
def resolve_wdir(wdir, path): ...
def resolve_paths(fs, path, wdir=None): ...
Import
from dvc.stage.utils import (
compute_md5,
validate_kwargs,
check_stage_path,
fill_stage_outputs,
fill_stage_dependencies,
check_circular_dependency,
check_duplicated_arguments,
check_missing_outputs,
get_dump,
)
Input/Output
| Function | Input | Output |
|---|---|---|
check_stage_path() |
repo -- Repo instance; path -- filesystem path to check; is_wdir -- whether the path is a working directory |
None (raises StagePathNotFoundError, StagePathNotDirectoryError, or StagePathOutsideError)
|
fill_stage_outputs() |
stage -- Stage object; **kwargs with keys like outs, metrics, plots, and their persist/no_cache variants |
None (mutates stage.outs in place)
|
fill_stage_dependencies() |
stage -- Stage object; deps, erepo, params, fs_config, db |
None (mutates stage.deps in place)
|
compute_md5() |
stage -- Stage object |
str -- MD5 hex digest of the stage configuration
|
get_dump() |
stage: Stage; **kwargs passed to dep.dumpd() and out.dumpd() |
dict -- serialized stage data with falsy values filtered out
|
validate_kwargs() |
single_stage: bool; fname: Optional[str]; **kwargs including cmd, name |
dict[str, Any] -- validated and cleaned kwargs (raises InvalidArgumentError on validation failure)
|
split_params_deps() |
stage: Stage |
tuple[list[ParamsDependency], list[Dependency]]
|
check_stage_exists() |
repo: Repo; stage: Stage or PipelineStage; path: str |
None (raises StageFileAlreadyExistsError or DuplicateStageName)
|
Example
from dvc.stage.utils import compute_md5, validate_kwargs, check_stage_path
# Validate CLI arguments before creating a pipeline stage
kwargs = validate_kwargs(
single_stage=False,
fname=None,
name="train",
cmd="python train.py",
deps=["data/train.csv"],
outs=["model.pkl"],
)
# Verify the working directory is valid
check_stage_path(repo, "/path/to/project", is_wdir=True)
# Compute the MD5 of an existing stage for change detection
md5 = compute_md5(stage)
print(f"Stage MD5: {md5}")