Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Iterative Dvc Remote Data Sync

From Leeroopedia


Knowledge Sources
Domains Data_Versioning, Cloud_Storage, MLOps
Last Updated 2026-02-10 10:30 GMT

Overview

End-to-end process for synchronizing DVC-tracked data between the local cache and remote storage backends (S3, GCS, Azure, SSH, HDFS, and others), enabling team collaboration and data backup.

Description

This workflow covers the three core data transfer operations in DVC: push (upload local cache to remote), pull (download from remote and checkout to workspace), and fetch (download to local cache without workspace checkout). These operations use a content-addressable transfer protocol that compares hash inventories between local and remote storage to transfer only missing objects. DVC supports multiple remote storage backends through the fsspec filesystem abstraction layer and handles both legacy (MD5) and modern hash formats transparently.

Goal: Synchronized data between local cache, workspace, and one or more remote storage backends.

Scope: From local cache state through remote comparison to bidirectional data transfer.

Strategy: Content-hash comparison with parallel transfer and progress reporting via the DataCloud abstraction.

Usage

Execute this workflow when:

  • You need to share tracked data with team members by uploading to a shared remote (push)
  • You are setting up a new workspace and need to download data tracked in the repository (pull)
  • You want to pre-fetch data to the local cache without modifying the workspace (fetch)
  • You are backing up data artifacts to cloud storage
  • A CI/CD pipeline needs access to DVC-tracked data

Execution Steps

Step 1: Configure Remote Storage

Before any transfer can occur, at least one remote storage backend must be configured. DVC resolves the remote by checking the `--remote` flag, falling back to the default remote set in DVC configuration. The remote configuration specifies the storage URL, authentication credentials, and transport options. Multiple remotes can be configured with one designated as default.

Key considerations:

  • Remote configuration is stored in `.dvc/config` with layered precedence (system, global, repo, local)
  • Each remote maps to a storage URL (e.g., `s3://bucket/path`, `gs://bucket`, `azure://container`)
  • Authentication is handled via storage-specific mechanisms (AWS profiles, GCP service accounts, Azure SAS tokens)
  • Worktree remotes enable version-aware cloud storage with special handling

Step 2: Collect Transfer Targets

DVC determines which data objects need to be transferred by scanning the repository index. It collects hash information from all tracked outputs across the specified revisions (current workspace, branches, tags, or all commits). Targets can be filtered by path, and dependency-based filtering is supported.

Key considerations:

  • The `--all-branches`, `--all-tags`, and `--all-commits` flags expand the collection scope
  • The `--with-deps` flag includes outputs from upstream pipeline stages
  • The `--recursive` flag includes outputs from subdirectories
  • Hash information is split into legacy and default formats for backward compatibility

Step 3: Compare Local and Remote State

DVC compares the collected hash inventories between local cache and remote storage to determine which objects are missing on either side. This comparison uses indexed lookups for efficiency, maintaining a persistent data index that caches remote inventory state.

Key considerations:

  • The comparison yields four categories: ok (present on both), missing (neither), new (local only), deleted (remote only)
  • Push operations identify objects in local cache but not on remote
  • Fetch operations identify objects on remote but not in local cache
  • The data index is cached and invalidated when transfers complete

Step 4: Execute Transfer

Missing objects are transferred in parallel using configurable concurrency. Push operations upload from local cache to remote; fetch operations download from remote to local cache. Progress is reported via a callback-driven progress bar. Transfer failures are tracked and reported.

Key considerations:

  • The `--jobs` flag controls transfer parallelism
  • Transfer uses the fsspec filesystem abstraction for storage-agnostic operations
  • Failed transfers raise `UploadError` or `DownloadError` with counts of failed items
  • Run-cache entries can optionally be transferred alongside data objects

Step 5: Checkout to Workspace (Pull Only)

When performing a pull operation, after fetching data to the local cache, DVC checks out the files to the workspace. This restores the actual data files from the cache using the configured link type (reflink, hardlink, symlink, or copy). The checkout operation reconciles the workspace state with the `.dvc` and `dvc.lock` file specifications.

Key considerations:

  • Checkout is skipped for fetch-only operations
  • The `--force` flag overwrites modified workspace files
  • The `--allow-missing` flag tolerates missing cache entries without error
  • Checkout reports statistics on added, modified, and deleted files

Step 6: Update Data Index

After successful transfer, DVC updates its persistent data index to reflect the new synchronization state. For version-aware remotes (cloud versioned storage), push operations also update output metadata with version IDs from the remote.

Key considerations:

  • The data index is dropped and rebuilt after transfers to ensure consistency
  • Version-aware remote metadata is written back to stage files
  • The index serves as a cache to avoid redundant remote inventory queries

Execution Diagram

GitHub URL

Workflow Repository