Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Iterative Dvc Target Resolution

From Leeroopedia


Knowledge Sources
Domains Data_Versioning, File_Management
Last Updated 2026-02-10 00:00 GMT

Overview

Target resolution is the process by which a data version control system translates user-specified file or directory paths into concrete, trackable entities within its managed workspace.

Description

When a user issues a command such as dvc add data/images, the system receives a raw path string that may refer to a single file, a directory, a glob pattern, or even a path that overlaps with an already-tracked output. Target resolution is the first critical step in the data tracking workflow: it must disambiguate these references and determine whether each resolved path corresponds to an existing tracked entity or requires a brand-new stage definition.

The resolution process involves several distinct concerns. First, the system must normalize the input -- accepting strings, bytes, or path-like objects and converting them into a uniform list of filesystem path strings. If glob expansion is enabled, wildcard patterns such as data/*.csv are expanded against the current working directory to produce a concrete list of matching paths. Second, for each resolved path, the system must consult the repository's existing tracking metadata (its index of outputs and stages) to determine whether the target is already under management. If an existing output is found, its parent stage is reused; otherwise, a new single-stage definition is created with the appropriate working directory and output specification.

This two-phase design -- find targets followed by get or create stage -- ensures that the system remains idempotent when re-adding files that are already tracked, while also providing clear error messages when a target overlaps with a pipeline-managed output that should not be directly modified.

Usage

Target resolution is invoked whenever a user adds data files to version control. It is the design pattern to apply when:

  • The system must accept heterogeneous path inputs (strings, bytes, iterators of paths, glob patterns) and normalize them into a consistent internal representation.
  • A decision must be made between reusing existing tracking metadata and creating new metadata, based on whether the target is already known to the repository index.
  • Error boundaries must be established early -- for example, rejecting attempts to re-add outputs that belong to pipeline stages, which should only be updated through pipeline execution or explicit commits.

Theoretical Basis

Target resolution draws on two foundational ideas:

Path normalization and glob expansion. Filesystem paths come in many representations across operating systems and programming languages. A robust target resolution system must canonicalize all inputs into a single form (typically POSIX-style strings relative to the repository root) before performing any lookups. Glob expansion follows the well-established rules of shell-style pattern matching, where * matches any sequence of non-separator characters and ** matches across directory boundaries.

Idempotent upsert semantics. The get-or-create pattern follows the database concept of an upsert -- if a record (stage) already exists for the given key (output path), return it; otherwise, insert a new one. In pseudocode:

function resolve_target(repo, target_path):
    normalized = normalize_path(target_path)
    existing_output = repo.index.find_output(normalized)
    if existing_output exists:
        if existing_output.stage is a pipeline stage:
            raise Error("cannot directly modify pipeline output")
        return existing_output.stage, output_exists=True
    else:
        new_stage = create_single_stage(repo, normalized)
        return new_stage, output_exists=False

This pattern guarantees that repeated invocations of the add command on the same path do not create duplicate stage definitions, maintaining a clean one-to-one mapping between output paths and their controlling stages.

Trie-based path lookup. To efficiently determine whether a path is already tracked -- especially when outputs may be directories that contain the target, or the target may be a directory containing existing outputs -- the system can employ a trie (prefix tree) keyed on path components. This allows O(k) lookup where k is the depth of the path, regardless of the total number of tracked outputs.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment