Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Iterative Dvc Find Targets And Create Stage

From Leeroopedia


Knowledge Sources
Domains Data_Versioning, File_Management
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for resolving user-specified file paths into DVC-trackable targets and obtaining or creating the corresponding stage definitions, provided by the DVC library.

Description

The find_targets function and get_or_create_stage function in DVC's dvc/repo/add.py module form the target resolution subsystem of the dvc add workflow. find_targets accepts heterogeneous path inputs -- strings, bytes, os.PathLike objects, or iterators thereof -- and normalizes them into a flat list of string paths, optionally expanding glob patterns. get_or_create_stage then takes each resolved path and either locates an existing stage that already tracks it (via repo.find_outs_by_path) or creates a new single-stage definition. The result is returned as a StageInfo named tuple containing the stage object and a boolean indicating whether the output already existed.

Together, these two functions implement the idempotent upsert pattern for data tracking: repeated calls to dvc add on the same file reuse the existing stage rather than creating duplicates. If the target path overlaps with a pipeline-managed output (a non-data-source stage), a DvcException is raised to prevent unintended modifications.

Usage

Import and use these functions when programmatically adding files to DVC tracking, or when building tooling that needs to resolve user-provided paths to their corresponding DVC stages. They are called internally by the add() function but can also be used independently for path resolution and stage lookup.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/repo/add.py
  • Lines: L30-75

Signature

class StageInfo(NamedTuple):
    stage: "Stage"
    output_exists: bool


def find_targets(
    targets: Union["StrOrBytesPath", Iterator["StrOrBytesPath"]],
    glob: bool = False,
) -> list[str]:
    ...


def get_or_create_stage(
    repo: "Repo",
    target: str,
    out: Optional[str] = None,
    to_remote: bool = False,
    force: bool = False,
) -> StageInfo:
    ...

Import

from dvc.repo.add import find_targets, get_or_create_stage, StageInfo

I/O Contract

Inputs

Name Type Required Description
targets Union[StrOrBytesPath, Iterator[StrOrBytesPath]] Yes A single file/directory path (as str, bytes, or os.PathLike) or an iterator of such paths to resolve into tracking targets.
glob bool No When True, expands shell-style glob patterns (e.g., *.csv, data/**) in the targets list. Defaults to False.
repo Repo Yes The DVC repository instance, used to search for existing outputs and create new stages. Required by get_or_create_stage only.
target str Yes A single resolved target path string. Required by get_or_create_stage only.
out Optional[str] No An explicit output name/path to use instead of deriving it from the target. When provided, the target is treated as a source and the output is resolved separately.
to_remote bool No If True, indicates the data will be transferred directly to a remote rather than cached locally. Affects path resolution behavior. Defaults to False.
force bool No If True, allows overwriting existing output definitions. Defaults to False.

Outputs

Name Type Description
(find_targets return) list[str] A flat list of resolved filesystem path strings, with glob patterns expanded if requested. Empty list if no targets match.
(get_or_create_stage return) StageInfo A named tuple with two fields: stage (the Stage object that tracks the target) and output_exists (True if an existing stage was found, False if a new stage was created).

Usage Examples

Basic Usage

from dvc.repo import Repo
from dvc.repo.add import find_targets, get_or_create_stage

# Initialize a DVC repository
repo = Repo()

# Resolve a single file target
targets = find_targets("data/train.csv")
# Returns: ["data/train.csv"]

# Resolve multiple targets with glob expansion
targets = find_targets(["data/*.csv", "models/*.pkl"], glob=True)
# Returns: ["data/train.csv", "data/test.csv", "models/model.pkl"]

# Get or create a stage for a resolved target
stage_info = get_or_create_stage(repo, "data/train.csv")
print(stage_info.stage)          # Stage object for data/train.csv.dvc
print(stage_info.output_exists)  # False if newly created, True if already tracked

# Re-adding the same target reuses the existing stage
stage_info2 = get_or_create_stage(repo, "data/train.csv")
print(stage_info2.output_exists)  # True

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment