Principle:Iterative Dvc Transfer Target Collection
| Knowledge Sources | |
|---|---|
| Domains | Data_Synchronization, Version_Control |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Transfer target collection is the process of building filtered views of a repository's data index across one or more revisions to identify precisely which data objects require synchronization with a remote.
Description
Before any data transfer (push, fetch, or pull) can begin, the system must answer the question: "Which data files need to move, and from which revision?" In a version-controlled data management system, the answer is not always straightforward. A single repository may contain data tracked across multiple branches, tags, and commits. Users may want to synchronize data for just the current workspace, for all branches, or for a specific set of revision identifiers.
Transfer target collection solves this by iterating over the requested revisions (using a "brancher" that checks out each revision in turn), collecting the repository's data index at each revision, and then applying filters. Filters can exclude stages (e.g., repository imports during push), exclude outputs that do not match a specified remote, and limit by file size or type (metrics, plots, params). The result is a dictionary mapping each revision identifier to a filtered IndexView -- a read-only projection of the full data index containing only the relevant outputs and their storage mappings.
This approach enables cross-revision data synchronization, where a single command can ensure that data from all branches is available on the remote. It also supports targeted synchronization, where only specific DVC-tracked files or directories are included, and type-based filtering, where only metrics or plots are transferred.
Usage
Use transfer target collection when:
- Preparing for a push or fetch operation that must determine which objects to transfer.
- Synchronizing data across multiple branches or tags in a single operation (all_branches, all_tags).
- Targeting specific files or directories for transfer rather than the entire data index.
- Filtering transfers to specific output types (metrics, params, plots) or size limits.
- Building a pre-transfer manifest that can be inspected or logged before actual transfer begins.
Theoretical Basis
The collection algorithm follows a revision-iteration-with-filtering pattern:
COLLECT_INDEXES(repo, targets, remote, revisions, filters):
indexes = {}
for rev in brancher(repo, revisions):
# Build a filtered index view for this revision
idx = index_from_targets(repo, targets, filters)
# Attach storage maps so downstream knows where to find remote data
idx.data["repo"].onerror = error_handler(rev)
indexes[rev] = idx
return indexes
The index_from_targets function itself performs a two-phase operation:
INDEX_FROM_TARGETS(repo, targets, filters):
# Phase 1: Collect stages matching targets
if targets are specific DVC files:
index = Index.from_file(repo, each_target)
else:
index = repo.index # full repository index
# Phase 2: Apply filters to produce a read-only view
view = index.targets_view(
targets,
stage_filter, # e.g., exclude repo imports during push
outs_filter, # e.g., exclude non-pushable outputs
max_size, # optional size ceiling
types # optional type restriction
)
return view
The resulting IndexView objects are lightweight projections. They do not copy data; instead, they filter iteration over the parent index's data structures. Each view carries a storage_map that associates data keys with their cache and remote storage locations, enabling downstream transfer code to resolve both the source and destination of every object.
Key design properties:
- Lazy evaluation: Index views are computed on demand, avoiding expensive full-index materialization when only a subset of data is needed.
- Composability: Multiple views from different revisions can be combined into a single transfer operation.
- Isolation: Each revision's view is independent; a failure in one revision's collection does not prevent others from succeeding (with appropriate error handling).