Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Iterative Dvc Cache Transfer

From Leeroopedia


Knowledge Sources
Domains Data_Versioning, Storage_Management
Last Updated 2026-02-10 00:00 GMT

Overview

Cache transfer is the process of moving or linking versioned data between a user's working directory and a content-addressable cache store, using configurable link strategies to balance storage efficiency, performance, and data safety.

Description

Data version control systems maintain two distinct storage locations for tracked files: the workspace (the user's working directory where files are read and modified) and the cache (a content-addressable store where immutable versions are preserved). The cache transfer principle governs how data moves between these two locations.

When a file is added to tracking, its contents must be transferred into the cache so that the current version is preserved. Simultaneously, the workspace copy must be updated to reference the cached version. When a user checks out a different version (e.g., switching git branches), the workspace copy must be replaced with the appropriate cached version. These bidirectional transfers must be efficient in both time and space, especially for large data files.

The key insight is that a naive implementation -- making full copies in both directions -- doubles storage consumption for every tracked file. Instead, the system supports multiple link types that allow the workspace and cache to share the same physical data blocks on disk. The choice of link type represents a three-way tradeoff between space efficiency, modification safety, and filesystem compatibility.

Usage

Cache transfer operations are invoked whenever:

  • A file is being added to DVC tracking (workspace to cache transfer, then relinking).
  • A version checkout is performed (cache to workspace transfer via linking).
  • A pipeline stage completes execution and its outputs are committed to the cache.
  • The user runs dvc checkout to synchronize workspace files with their recorded versions.
  • Cache link types are reconfigured and existing files need relinking.

Theoretical Basis

Link type hierarchy. The system supports four link strategies, ordered by preference:

Link Type     | Space Savings | Modification Safety | OS Support
--------------+---------------+---------------------+-------------------
Reflink       | Maximum       | Safe (copy-on-write) | macOS APFS, XFS, Btrfs
Hardlink      | Maximum       | Unsafe (shared data) | All POSIX, NTFS
Symlink       | Maximum       | Safe (read-only ref) | POSIX, Windows (admin)
Copy          | None          | Safe (independent)   | Universal

Reflinks (copy-on-write). A reflink creates a new directory entry pointing to the same physical data blocks. The filesystem transparently copies blocks only when one of the copies is modified. This provides the ideal combination of zero initial space overhead and full modification safety:

function reflink(source, target):
    // Filesystem-level operation: ioctl(FICLONE)
    // Initial state: source and target share all data blocks
    // On write to either: filesystem copies only the modified block
    space_used = 0  // initially
    modification_safe = True

Hardlinks. A hardlink creates a second directory entry pointing to the same inode. Both paths reference identical data with zero space overhead. However, modifications through either path affect the cached version, potentially corrupting the version history. Systems using hardlinks must therefore protect cache files with read-only permissions:

function hardlink(cache_path, workspace_path):
    os.link(cache_path, workspace_path)
    os.chmod(cache_path, read_only)
    // Danger: if workspace file is modified in-place,
    // the cache copy is also modified

Transfer protocol. The complete add-to-cache workflow follows this sequence:

function transfer_to_cache_and_relink(file, cache):
    1. hash = compute_hash(file)
    2. cache_path = cache.oid_to_path(hash)
    3. transfer(file -> cache_path)          // Stage to cache
    4. remove(file)                          // Remove workspace original
    5. link(cache_path -> file)              // Relink from cache
    6. update_state_db(file, hash, mtime)    // Record for future change detection

The relink step (4-5) is critical: after the file is safely in the cache, the workspace copy is replaced with a link to the cached version. This ensures the workspace file and cache entry share storage while the cache remains the authoritative copy.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment