Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Iterative Dvc Workspace Checkout

From Leeroopedia


Knowledge Sources
Domains Data_Versioning, Workspace_Management
Last Updated 2026-02-10 00:00 GMT

Overview

Workspace checkout is the process of synchronizing a working directory with the expected data state recorded in version-control metafiles by computing diffs against the content-addressed cache and applying additions, modifications, and deletions.

Description

After data has been fetched into a local cache, the working directory may still contain stale, missing, or extra files relative to what the current DVC metafiles (.dvc files and dvc.lock files) declare. Workspace checkout bridges this gap by computing a three-way comparison between the current workspace state and the expected state defined by the data index, then applying the minimal set of changes to bring the workspace into alignment.

The process operates in four phases. First, a workspace data index is built by scanning the actual files on disk, computing hashes where needed, and recording file metadata (size, modification time, inode). Second, the expected data index is derived from the repository's DVC metafiles, which record the hash and metadata for every tracked output. Third, a diff is computed between these two indexes, classifying every entry as ADD (needs to be created), DELETE (needs to be removed), or MODIFY (needs to be replaced). Fourth, the diff is applied by creating, removing, or relinking files via the cache.

File creation uses a linking strategy that avoids unnecessary data duplication. Depending on the system configuration and filesystem capabilities, files may be hard-linked, symlinked, reflinked (copy-on-write), or copied from the cache. This strategy is transparent to the diff/apply logic, which delegates the actual file creation to the cache manager.

The checkout also maintains a link tracking database (the "state" system) that records which files in the workspace are links to cache objects. This enables efficient detection of stale links from previous checkouts that are no longer referenced by any output, which are cleaned up during full-workspace checkouts.

Usage

Use workspace checkout when:

  • Switching between Git branches or tags that reference different versions of tracked data.
  • Restoring workspace files after a fetch or pull that populated the cache but did not update the workspace.
  • Forcing a clean workspace state with the force flag to discard local modifications to tracked files.
  • Relinking workspace files after changing the cache link type configuration.
  • Running automated pipelines that need the workspace to exactly match the recorded data state before execution.

Theoretical Basis

The checkout algorithm follows a diff-then-apply pattern:

CHECKOUT(repo, targets, force, relink):
    # Phase 1: Build workspace index (what exists on disk now)
    old_index = build_data_index(
        view,
        root_dir,
        filesystem,
        compute_hash=True
    )

    # Phase 2: Get expected index (what should exist per metafiles)
    new_index = view.data["repo"]

    # Phase 3: Compute diff
    diff = compare(
        old_index,
        new_index,
        relink=relink,    # treat all entries as modified if relinking
        delete=True        # include deletions
    )

    # Phase 4: Safety check (unless forced)
    if not force:
        for entry in diff.files_delete:
            if not exists_in_cache(entry):
                raise "Cannot delete uncached file without --force"

    # Phase 5: Apply changes
    apply(diff, root_dir, filesystem,
          update_meta=False,
          onerror=log_and_track_failures)

    # Phase 6: Collect statistics
    stats = count(diff.changes, by_type=[ADD, DELETE, MODIFY])
    return {added: [...], modified: [...], deleted: [...], stats: stats}

The compare function performs an entry-by-entry comparison between old and new indexes. For each key (file path expressed as a tuple of path components):

for each key in UNION(old_index.keys, new_index.keys):
    old_entry = old_index.get(key)
    new_entry = new_index.get(key)

    if old_entry is None:
        yield Change(ADD, new=new_entry)
    elif new_entry is None:
        yield Change(DELETE, old=old_entry)
    elif old_entry.hash != new_entry.hash or relink:
        yield Change(MODIFY, old=old_entry, new=new_entry)

The apply function then processes each change type:

  • ADD: Create the file by linking from cache (hardlink, symlink, reflink, or copy).
  • DELETE: Remove the file from the workspace.
  • MODIFY: Remove the old file and create the new one from cache.

Key design properties:

  • Minimal changes: Only files that differ between workspace and expected state are touched.
  • Safety by default: Files that exist in the workspace but not in cache cannot be deleted without --force, preventing accidental data loss.
  • Failure tolerance: Individual file failures are tracked and reported without aborting the entire operation.
  • Link tracking: Post-checkout, links are registered in the state database for future cleanup.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment