Principle:Iterative Dvc Data Transfer Execution
| Knowledge Sources | |
|---|---|
| Domains | Data_Synchronization, Distributed_Storage |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Data transfer execution is the process of performing parallel, fault-tolerant file transfers between a local content-addressed cache and remote storage backends, with real-time progress tracking and structured error reporting.
Description
Once the system has determined which objects need to move (via target collection and state comparison), the actual transfer must execute efficiently across potentially thousands of files and terabytes of data. Data transfer execution encompasses the full lifecycle of moving data between local and remote stores: collecting transferable entries from data indexes, dispatching parallel workers to perform the actual file operations, tracking progress for user feedback, and reporting both successes and failures in a structured manner.
The transfer operates in two distinct modes depending on the remote type. For traditional object-store remotes, transfers move content-addressed blobs identified by their hash. The local cache stores files under hash-derived paths, and the remote object database mirrors this structure. For worktree remotes (cloud-native version-aware storage), transfers operate on data index entries that carry version IDs and filesystem metadata, enabling direct synchronization of file trees rather than individual hash objects.
Both modes support partial failure recovery. If some files fail to transfer (due to network errors, permission issues, or missing source files), the operation completes what it can and then raises a structured error (UploadError or DownloadError) that reports the exact count of failures. This allows CI pipelines to distinguish between total failures and partial successes, and lets users retry only the failed subset.
Progress tracking uses a callback-based system where the transfer engine reports each completed file to a progress bar. This is essential for large transfers where users need visibility into the operation's progress and remaining time.
Usage
Use data transfer execution when:
- Pushing locally cached data to a remote storage backend for sharing or backup.
- Fetching remotely stored data into the local cache for consumption by the workspace.
- Running automated CI/CD pipelines that must synchronize data before or after pipeline stages.
- Performing bulk data migration between storage backends.
- Building custom sync tools that need reliable, resumable, parallel file transfer.
Theoretical Basis
The transfer execution follows a collect-then-dispatch architecture:
PUSH(repo, targets, remote, jobs):
# Phase 1: Collect what needs to transfer
indexes = collect_indexes(repo, targets, remote, push=True)
data = collect(indexes, direction="remote", cache_index=repo.data_index)
# Phase 2: Dispatch parallel transfers
transferred, failed = ipush(data, jobs=jobs, callback=progress_bar)
# Phase 3: Post-transfer bookkeeping
if transferred:
update_meta(index) # merge remote version IDs back
drop_data_index() # invalidate cached index
# Phase 4: Error reporting
if failed:
raise UploadError(failed)
return transferred
FETCH(repo, targets, remote, jobs):
# Phase 1: Collect what needs to transfer
indexes = collect_indexes(repo, targets, remote)
data = collect(indexes, direction="remote", cache_index=repo.data_index)
# Phase 2: Filter unversioned entries (worktree remotes)
data = log_unversioned(data)
# Phase 3: Dispatch parallel transfers
transferred, failed = ifetch(data, jobs=jobs, callback=progress_bar)
# Phase 4: Cache invalidation
if transferred:
drop_data_index()
# Phase 5: Error reporting
if failed:
raise DownloadError(failed)
return transferred
The collect step (from dvc_data.index.fetch/push) transforms IndexView objects into flat lists of DataIndex entries annotated with source and destination storage information. It also consults a cache_index to skip entries that are already known to be present at the destination, using a tokenized cache key derived from the data tree hashes to detect when the cache is stale.
The ipush/ifetch functions (from dvc_data.index.push/fetch) perform the actual parallel I/O. They use thread pools sized by the jobs parameter, with each worker handling one file at a time. The callback mechanism allows the caller to update a progress bar after each completed transfer.
Key design properties:
- Atomicity per file: Each file transfer is atomic -- it either succeeds completely or is counted as failed. No partial files are left on the remote.
- Idempotency: Re-running a push or fetch with the same inputs produces the same result. Already-transferred files are detected via the cache index and skipped.
- Separation of concerns: The DVC layer handles index collection, error reporting, and metadata updates; the dvc-data layer handles the actual I/O.