Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Iterative Dvc Data Transfer Execution

From Leeroopedia


Knowledge Sources
Domains Data_Synchronization, Distributed_Storage
Last Updated 2026-02-10 00:00 GMT

Overview

Data transfer execution is the process of performing parallel, fault-tolerant file transfers between a local content-addressed cache and remote storage backends, with real-time progress tracking and structured error reporting.

Description

Once the system has determined which objects need to move (via target collection and state comparison), the actual transfer must execute efficiently across potentially thousands of files and terabytes of data. Data transfer execution encompasses the full lifecycle of moving data between local and remote stores: collecting transferable entries from data indexes, dispatching parallel workers to perform the actual file operations, tracking progress for user feedback, and reporting both successes and failures in a structured manner.

The transfer operates in two distinct modes depending on the remote type. For traditional object-store remotes, transfers move content-addressed blobs identified by their hash. The local cache stores files under hash-derived paths, and the remote object database mirrors this structure. For worktree remotes (cloud-native version-aware storage), transfers operate on data index entries that carry version IDs and filesystem metadata, enabling direct synchronization of file trees rather than individual hash objects.

Both modes support partial failure recovery. If some files fail to transfer (due to network errors, permission issues, or missing source files), the operation completes what it can and then raises a structured error (UploadError or DownloadError) that reports the exact count of failures. This allows CI pipelines to distinguish between total failures and partial successes, and lets users retry only the failed subset.

Progress tracking uses a callback-based system where the transfer engine reports each completed file to a progress bar. This is essential for large transfers where users need visibility into the operation's progress and remaining time.

Usage

Use data transfer execution when:

  • Pushing locally cached data to a remote storage backend for sharing or backup.
  • Fetching remotely stored data into the local cache for consumption by the workspace.
  • Running automated CI/CD pipelines that must synchronize data before or after pipeline stages.
  • Performing bulk data migration between storage backends.
  • Building custom sync tools that need reliable, resumable, parallel file transfer.

Theoretical Basis

The transfer execution follows a collect-then-dispatch architecture:

PUSH(repo, targets, remote, jobs):
    # Phase 1: Collect what needs to transfer
    indexes = collect_indexes(repo, targets, remote, push=True)
    data = collect(indexes, direction="remote", cache_index=repo.data_index)

    # Phase 2: Dispatch parallel transfers
    transferred, failed = ipush(data, jobs=jobs, callback=progress_bar)

    # Phase 3: Post-transfer bookkeeping
    if transferred:
        update_meta(index)        # merge remote version IDs back
        drop_data_index()         # invalidate cached index

    # Phase 4: Error reporting
    if failed:
        raise UploadError(failed)
    return transferred


FETCH(repo, targets, remote, jobs):
    # Phase 1: Collect what needs to transfer
    indexes = collect_indexes(repo, targets, remote)
    data = collect(indexes, direction="remote", cache_index=repo.data_index)

    # Phase 2: Filter unversioned entries (worktree remotes)
    data = log_unversioned(data)

    # Phase 3: Dispatch parallel transfers
    transferred, failed = ifetch(data, jobs=jobs, callback=progress_bar)

    # Phase 4: Cache invalidation
    if transferred:
        drop_data_index()

    # Phase 5: Error reporting
    if failed:
        raise DownloadError(failed)
    return transferred

The collect step (from dvc_data.index.fetch/push) transforms IndexView objects into flat lists of DataIndex entries annotated with source and destination storage information. It also consults a cache_index to skip entries that are already known to be present at the destination, using a tokenized cache key derived from the data tree hashes to detect when the cache is stale.

The ipush/ifetch functions (from dvc_data.index.push/fetch) perform the actual parallel I/O. They use thread pools sized by the jobs parameter, with each worker handling one file at a time. The callback mechanism allows the caller to update a progress bar after each completed transfer.

Key design properties:

  • Atomicity per file: Each file transfer is atomic -- it either succeeds completely or is counted as failed. No partial files are left on the remote.
  • Idempotency: Re-running a push or fetch with the same inputs produces the same result. Already-transferred files are detected via the cache index and skipped.
  • Separation of concerns: The DVC layer handles index collection, error reporting, and metadata updates; the dvc-data layer handles the actual I/O.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment