Principle:Iterative Dvc Cache Transfer
| Knowledge Sources | |
|---|---|
| Domains | Data_Versioning, Storage_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Cache transfer is the process of moving or linking versioned data between a user's working directory and a content-addressable cache store, using configurable link strategies to balance storage efficiency, performance, and data safety.
Description
Data version control systems maintain two distinct storage locations for tracked files: the workspace (the user's working directory where files are read and modified) and the cache (a content-addressable store where immutable versions are preserved). The cache transfer principle governs how data moves between these two locations.
When a file is added to tracking, its contents must be transferred into the cache so that the current version is preserved. Simultaneously, the workspace copy must be updated to reference the cached version. When a user checks out a different version (e.g., switching git branches), the workspace copy must be replaced with the appropriate cached version. These bidirectional transfers must be efficient in both time and space, especially for large data files.
The key insight is that a naive implementation -- making full copies in both directions -- doubles storage consumption for every tracked file. Instead, the system supports multiple link types that allow the workspace and cache to share the same physical data blocks on disk. The choice of link type represents a three-way tradeoff between space efficiency, modification safety, and filesystem compatibility.
Usage
Cache transfer operations are invoked whenever:
- A file is being added to DVC tracking (workspace to cache transfer, then relinking).
- A version checkout is performed (cache to workspace transfer via linking).
- A pipeline stage completes execution and its outputs are committed to the cache.
- The user runs dvc checkout to synchronize workspace files with their recorded versions.
- Cache link types are reconfigured and existing files need relinking.
Theoretical Basis
Link type hierarchy. The system supports four link strategies, ordered by preference:
Link Type | Space Savings | Modification Safety | OS Support
--------------+---------------+---------------------+-------------------
Reflink | Maximum | Safe (copy-on-write) | macOS APFS, XFS, Btrfs
Hardlink | Maximum | Unsafe (shared data) | All POSIX, NTFS
Symlink | Maximum | Safe (read-only ref) | POSIX, Windows (admin)
Copy | None | Safe (independent) | Universal
Reflinks (copy-on-write). A reflink creates a new directory entry pointing to the same physical data blocks. The filesystem transparently copies blocks only when one of the copies is modified. This provides the ideal combination of zero initial space overhead and full modification safety:
function reflink(source, target):
// Filesystem-level operation: ioctl(FICLONE)
// Initial state: source and target share all data blocks
// On write to either: filesystem copies only the modified block
space_used = 0 // initially
modification_safe = True
Hardlinks. A hardlink creates a second directory entry pointing to the same inode. Both paths reference identical data with zero space overhead. However, modifications through either path affect the cached version, potentially corrupting the version history. Systems using hardlinks must therefore protect cache files with read-only permissions:
function hardlink(cache_path, workspace_path):
os.link(cache_path, workspace_path)
os.chmod(cache_path, read_only)
// Danger: if workspace file is modified in-place,
// the cache copy is also modified
Transfer protocol. The complete add-to-cache workflow follows this sequence:
function transfer_to_cache_and_relink(file, cache):
1. hash = compute_hash(file)
2. cache_path = cache.oid_to_path(hash)
3. transfer(file -> cache_path) // Stage to cache
4. remove(file) // Remove workspace original
5. link(cache_path -> file) // Relink from cache
6. update_state_db(file, hash, mtime) // Record for future change detection
The relink step (4-5) is critical: after the file is safely in the cache, the workspace copy is replaced with a link to the cached version. This ensures the workspace file and cache entry share storage while the cache remains the authoritative copy.