Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Iterative Dvc Transfer Target Collection

From Leeroopedia


Knowledge Sources
Domains Data_Synchronization, Version_Control
Last Updated 2026-02-10 00:00 GMT

Overview

Transfer target collection is the process of building filtered views of a repository's data index across one or more revisions to identify precisely which data objects require synchronization with a remote.

Description

Before any data transfer (push, fetch, or pull) can begin, the system must answer the question: "Which data files need to move, and from which revision?" In a version-controlled data management system, the answer is not always straightforward. A single repository may contain data tracked across multiple branches, tags, and commits. Users may want to synchronize data for just the current workspace, for all branches, or for a specific set of revision identifiers.

Transfer target collection solves this by iterating over the requested revisions (using a "brancher" that checks out each revision in turn), collecting the repository's data index at each revision, and then applying filters. Filters can exclude stages (e.g., repository imports during push), exclude outputs that do not match a specified remote, and limit by file size or type (metrics, plots, params). The result is a dictionary mapping each revision identifier to a filtered IndexView -- a read-only projection of the full data index containing only the relevant outputs and their storage mappings.

This approach enables cross-revision data synchronization, where a single command can ensure that data from all branches is available on the remote. It also supports targeted synchronization, where only specific DVC-tracked files or directories are included, and type-based filtering, where only metrics or plots are transferred.

Usage

Use transfer target collection when:

  • Preparing for a push or fetch operation that must determine which objects to transfer.
  • Synchronizing data across multiple branches or tags in a single operation (all_branches, all_tags).
  • Targeting specific files or directories for transfer rather than the entire data index.
  • Filtering transfers to specific output types (metrics, params, plots) or size limits.
  • Building a pre-transfer manifest that can be inspected or logged before actual transfer begins.

Theoretical Basis

The collection algorithm follows a revision-iteration-with-filtering pattern:

COLLECT_INDEXES(repo, targets, remote, revisions, filters):
    indexes = {}

    for rev in brancher(repo, revisions):
        # Build a filtered index view for this revision
        idx = index_from_targets(repo, targets, filters)

        # Attach storage maps so downstream knows where to find remote data
        idx.data["repo"].onerror = error_handler(rev)

        indexes[rev] = idx

    return indexes

The index_from_targets function itself performs a two-phase operation:

INDEX_FROM_TARGETS(repo, targets, filters):
    # Phase 1: Collect stages matching targets
    if targets are specific DVC files:
        index = Index.from_file(repo, each_target)
    else:
        index = repo.index  # full repository index

    # Phase 2: Apply filters to produce a read-only view
    view = index.targets_view(
        targets,
        stage_filter,    # e.g., exclude repo imports during push
        outs_filter,     # e.g., exclude non-pushable outputs
        max_size,        # optional size ceiling
        types            # optional type restriction
    )
    return view

The resulting IndexView objects are lightweight projections. They do not copy data; instead, they filter iteration over the parent index's data structures. Each view carries a storage_map that associates data keys with their cache and remote storage locations, enabling downstream transfer code to resolve both the source and destination of every object.

Key design properties:

  • Lazy evaluation: Index views are computed on demand, avoiding expensive full-index materialization when only a subset of data is needed.
  • Composability: Multiple views from different revisions can be combined into a single transfer operation.
  • Isolation: Each revision's view is independent; a failure in one revision's collection does not prevent others from succeeding (with appropriate error handling).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment