Implementation:Iterative Dvc Find Targets And Create Stage
| Knowledge Sources | |
|---|---|
| Domains | Data_Versioning, File_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for resolving user-specified file paths into DVC-trackable targets and obtaining or creating the corresponding stage definitions, provided by the DVC library.
Description
The find_targets function and get_or_create_stage function in DVC's dvc/repo/add.py module form the target resolution subsystem of the dvc add workflow. find_targets accepts heterogeneous path inputs -- strings, bytes, os.PathLike objects, or iterators thereof -- and normalizes them into a flat list of string paths, optionally expanding glob patterns. get_or_create_stage then takes each resolved path and either locates an existing stage that already tracks it (via repo.find_outs_by_path) or creates a new single-stage definition. The result is returned as a StageInfo named tuple containing the stage object and a boolean indicating whether the output already existed.
Together, these two functions implement the idempotent upsert pattern for data tracking: repeated calls to dvc add on the same file reuse the existing stage rather than creating duplicates. If the target path overlaps with a pipeline-managed output (a non-data-source stage), a DvcException is raised to prevent unintended modifications.
Usage
Import and use these functions when programmatically adding files to DVC tracking, or when building tooling that needs to resolve user-provided paths to their corresponding DVC stages. They are called internally by the add() function but can also be used independently for path resolution and stage lookup.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/repo/add.py - Lines: L30-75
Signature
class StageInfo(NamedTuple):
stage: "Stage"
output_exists: bool
def find_targets(
targets: Union["StrOrBytesPath", Iterator["StrOrBytesPath"]],
glob: bool = False,
) -> list[str]:
...
def get_or_create_stage(
repo: "Repo",
target: str,
out: Optional[str] = None,
to_remote: bool = False,
force: bool = False,
) -> StageInfo:
...
Import
from dvc.repo.add import find_targets, get_or_create_stage, StageInfo
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| targets | Union[StrOrBytesPath, Iterator[StrOrBytesPath]] |
Yes | A single file/directory path (as str, bytes, or os.PathLike) or an iterator of such paths to resolve into tracking targets. |
| glob | bool |
No | When True, expands shell-style glob patterns (e.g., *.csv, data/**) in the targets list. Defaults to False. |
| repo | Repo |
Yes | The DVC repository instance, used to search for existing outputs and create new stages. Required by get_or_create_stage only. |
| target | str |
Yes | A single resolved target path string. Required by get_or_create_stage only. |
| out | Optional[str] |
No | An explicit output name/path to use instead of deriving it from the target. When provided, the target is treated as a source and the output is resolved separately. |
| to_remote | bool |
No | If True, indicates the data will be transferred directly to a remote rather than cached locally. Affects path resolution behavior. Defaults to False. |
| force | bool |
No | If True, allows overwriting existing output definitions. Defaults to False. |
Outputs
| Name | Type | Description |
|---|---|---|
| (find_targets return) | list[str] |
A flat list of resolved filesystem path strings, with glob patterns expanded if requested. Empty list if no targets match. |
| (get_or_create_stage return) | StageInfo |
A named tuple with two fields: stage (the Stage object that tracks the target) and output_exists (True if an existing stage was found, False if a new stage was created). |
Usage Examples
Basic Usage
from dvc.repo import Repo
from dvc.repo.add import find_targets, get_or_create_stage
# Initialize a DVC repository
repo = Repo()
# Resolve a single file target
targets = find_targets("data/train.csv")
# Returns: ["data/train.csv"]
# Resolve multiple targets with glob expansion
targets = find_targets(["data/*.csv", "models/*.pkl"], glob=True)
# Returns: ["data/train.csv", "data/test.csv", "models/model.pkl"]
# Get or create a stage for a resolved target
stage_info = get_or_create_stage(repo, "data/train.csv")
print(stage_info.stage) # Stage object for data/train.csv.dvc
print(stage_info.output_exists) # False if newly created, True if already tracked
# Re-adding the same target reuses the existing stage
stage_info2 = get_or_create_stage(repo, "data/train.csv")
print(stage_info2.output_exists) # True