Workflow:Iterative Dvc Remote Data Sync
| Knowledge Sources | |
|---|---|
| Domains | Data_Versioning, Cloud_Storage, MLOps |
| Last Updated | 2026-02-10 10:30 GMT |
Overview
End-to-end process for synchronizing DVC-tracked data between the local cache and remote storage backends (S3, GCS, Azure, SSH, HDFS, and others), enabling team collaboration and data backup.
Description
This workflow covers the three core data transfer operations in DVC: push (upload local cache to remote), pull (download from remote and checkout to workspace), and fetch (download to local cache without workspace checkout). These operations use a content-addressable transfer protocol that compares hash inventories between local and remote storage to transfer only missing objects. DVC supports multiple remote storage backends through the fsspec filesystem abstraction layer and handles both legacy (MD5) and modern hash formats transparently.
Goal: Synchronized data between local cache, workspace, and one or more remote storage backends.
Scope: From local cache state through remote comparison to bidirectional data transfer.
Strategy: Content-hash comparison with parallel transfer and progress reporting via the DataCloud abstraction.
Usage
Execute this workflow when:
- You need to share tracked data with team members by uploading to a shared remote (push)
- You are setting up a new workspace and need to download data tracked in the repository (pull)
- You want to pre-fetch data to the local cache without modifying the workspace (fetch)
- You are backing up data artifacts to cloud storage
- A CI/CD pipeline needs access to DVC-tracked data
Execution Steps
Step 1: Configure Remote Storage
Before any transfer can occur, at least one remote storage backend must be configured. DVC resolves the remote by checking the `--remote` flag, falling back to the default remote set in DVC configuration. The remote configuration specifies the storage URL, authentication credentials, and transport options. Multiple remotes can be configured with one designated as default.
Key considerations:
- Remote configuration is stored in `.dvc/config` with layered precedence (system, global, repo, local)
- Each remote maps to a storage URL (e.g., `s3://bucket/path`, `gs://bucket`, `azure://container`)
- Authentication is handled via storage-specific mechanisms (AWS profiles, GCP service accounts, Azure SAS tokens)
- Worktree remotes enable version-aware cloud storage with special handling
Step 2: Collect Transfer Targets
DVC determines which data objects need to be transferred by scanning the repository index. It collects hash information from all tracked outputs across the specified revisions (current workspace, branches, tags, or all commits). Targets can be filtered by path, and dependency-based filtering is supported.
Key considerations:
- The `--all-branches`, `--all-tags`, and `--all-commits` flags expand the collection scope
- The `--with-deps` flag includes outputs from upstream pipeline stages
- The `--recursive` flag includes outputs from subdirectories
- Hash information is split into legacy and default formats for backward compatibility
Step 3: Compare Local and Remote State
DVC compares the collected hash inventories between local cache and remote storage to determine which objects are missing on either side. This comparison uses indexed lookups for efficiency, maintaining a persistent data index that caches remote inventory state.
Key considerations:
- The comparison yields four categories: ok (present on both), missing (neither), new (local only), deleted (remote only)
- Push operations identify objects in local cache but not on remote
- Fetch operations identify objects on remote but not in local cache
- The data index is cached and invalidated when transfers complete
Step 4: Execute Transfer
Missing objects are transferred in parallel using configurable concurrency. Push operations upload from local cache to remote; fetch operations download from remote to local cache. Progress is reported via a callback-driven progress bar. Transfer failures are tracked and reported.
Key considerations:
- The `--jobs` flag controls transfer parallelism
- Transfer uses the fsspec filesystem abstraction for storage-agnostic operations
- Failed transfers raise `UploadError` or `DownloadError` with counts of failed items
- Run-cache entries can optionally be transferred alongside data objects
Step 5: Checkout to Workspace (Pull Only)
When performing a pull operation, after fetching data to the local cache, DVC checks out the files to the workspace. This restores the actual data files from the cache using the configured link type (reflink, hardlink, symlink, or copy). The checkout operation reconciles the workspace state with the `.dvc` and `dvc.lock` file specifications.
Key considerations:
- Checkout is skipped for fetch-only operations
- The `--force` flag overwrites modified workspace files
- The `--allow-missing` flag tolerates missing cache entries without error
- Checkout reports statistics on added, modified, and deleted files
Step 6: Update Data Index
After successful transfer, DVC updates its persistent data index to reflect the new synchronization state. For version-aware remotes (cloud versioned storage), push operations also update output metadata with version IDs from the remote.
Key considerations:
- The data index is dropped and rebuilt after transfers to ensure consistency
- Version-aware remote metadata is written back to stage files
- The index serves as a cache to avoid redundant remote inventory queries