Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Stage Utils

From Leeroopedia
Revision as of 15:20, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Iterative_Dvc_Stage_Utils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Domains

Pipeline_Management, Validation

Overview

Concrete tool for validating, processing, and serializing DVC stage configurations. The module dvc/stage/utils.py provides a collection of utility functions used throughout the stage lifecycle, including path validation, output/dependency population, integrity checks, MD5 computation, and stage serialization.

Description

This module contains stateless utility functions that operate on DVC Stage and PipelineStage objects. The functions fall into several categories:

Path validation:

  • check_stage_path() -- validates that a stage path exists, is a directory, and is within the DVC project root.
  • resolve_wdir() -- computes the relative working directory for serialization.
  • resolve_paths() -- resolves absolute paths for a stage file and its working directory.

Output and dependency population:

  • fill_stage_outputs() -- populates stage.outs from keyword arguments covering all output types (outs, metrics, plots, with persist/no_cache variants).
  • fill_stage_dependencies() -- populates stage.deps from deps, erepo, params, fs_config, and db arguments.

Integrity checks:

  • check_circular_dependency() -- raises CircularDependencyError if any dependency path also appears as an output path.
  • check_duplicated_arguments() -- raises ArgumentDuplicationError if any path appears more than once across deps and outs.
  • check_missing_outputs() -- raises MissingDataSource if any output does not exist on the filesystem.
  • check_no_externals() -- raises StageExternalOutputsError if any cached output is outside the DVC repo.

Serialization and computation:

  • compute_md5() -- computes an MD5 hash for a stage, excluding metadata, annotations, and certain output fields for backward compatibility.
  • get_dump() -- serializes a stage to a dictionary suitable for writing to dvc.yaml/dvc.lock.
  • validate_kwargs() -- validates CLI-provided keyword arguments for stage creation.

Other utilities:

  • split_params_deps() -- splits stage dependencies into ParamsDependency and regular Dependency lists.
  • is_valid_name() -- checks that a stage name contains no invalid characters.
  • prepare_file_path() -- derives a .dvc file path from the first output name.
  • check_stage_exists() -- checks for duplicate stage names or existing .dvc files.

Signature

def check_stage_path(repo, path, is_wdir=False): ...

def fill_stage_outputs(stage, **kwargs): ...

def fill_stage_dependencies(
    stage, deps=None, erepo=None, params=None, fs_config=None, db=None
): ...

def check_circular_dependency(stage): ...

def check_duplicated_arguments(stage): ...

def check_missing_outputs(stage): ...

def check_no_externals(stage): ...

def compute_md5(stage): ...

def get_dump(stage: "Stage", **kwargs): ...

def validate_kwargs(
    single_stage: bool = False, fname: Optional[str] = None, **kwargs
) -> dict[str, Any]: ...

def split_params_deps(
    stage: "Stage",
) -> tuple[list["ParamsDependency"], list["Dependency"]]: ...

def is_valid_name(name: str) -> bool: ...

def prepare_file_path(kwargs) -> str: ...

def check_stage_exists(
    repo: "Repo", stage: Union["Stage", "PipelineStage"], path: str
): ...

def resolve_wdir(wdir, path): ...

def resolve_paths(fs, path, wdir=None): ...

Import

from dvc.stage.utils import (
    compute_md5,
    validate_kwargs,
    check_stage_path,
    fill_stage_outputs,
    fill_stage_dependencies,
    check_circular_dependency,
    check_duplicated_arguments,
    check_missing_outputs,
    get_dump,
)

Input/Output

Function Input Output
check_stage_path() repo -- Repo instance; path -- filesystem path to check; is_wdir -- whether the path is a working directory None (raises StagePathNotFoundError, StagePathNotDirectoryError, or StagePathOutsideError)
fill_stage_outputs() stage -- Stage object; **kwargs with keys like outs, metrics, plots, and their persist/no_cache variants None (mutates stage.outs in place)
fill_stage_dependencies() stage -- Stage object; deps, erepo, params, fs_config, db None (mutates stage.deps in place)
compute_md5() stage -- Stage object str -- MD5 hex digest of the stage configuration
get_dump() stage: Stage; **kwargs passed to dep.dumpd() and out.dumpd() dict -- serialized stage data with falsy values filtered out
validate_kwargs() single_stage: bool; fname: Optional[str]; **kwargs including cmd, name dict[str, Any] -- validated and cleaned kwargs (raises InvalidArgumentError on validation failure)
split_params_deps() stage: Stage tuple[list[ParamsDependency], list[Dependency]]
check_stage_exists() repo: Repo; stage: Stage or PipelineStage; path: str None (raises StageFileAlreadyExistsError or DuplicateStageName)

Example

from dvc.stage.utils import compute_md5, validate_kwargs, check_stage_path

# Validate CLI arguments before creating a pipeline stage
kwargs = validate_kwargs(
    single_stage=False,
    fname=None,
    name="train",
    cmd="python train.py",
    deps=["data/train.csv"],
    outs=["model.pkl"],
)

# Verify the working directory is valid
check_stage_path(repo, "/path/to/project", is_wdir=True)

# Compute the MD5 of an existing stage for change detection
md5 = compute_md5(stage)
print(f"Stage MD5: {md5}")

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment