Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Iterative Dvc Data Index Update

From Leeroopedia


Knowledge Sources
Domains Data_Synchronization, Index_Management
Last Updated 2026-02-10 00:00 GMT

Overview

Data index update is the process of merging remote storage metadata -- such as version IDs and checksums -- back into local data index entries after push operations, ensuring that subsequent pull and status operations reference the correct remote objects.

Description

When data is pushed to a version-aware remote (such as an S3 bucket with versioning enabled), the remote assigns new metadata to each uploaded object: a unique version ID, an ETag (checksum), and other provider-specific attributes. If this metadata is not captured and recorded back into the local DVC metafiles, subsequent operations cannot correctly identify the remote objects. The data index update principle addresses this by defining a systematic process for propagating remote metadata back into the local index after a successful push.

The update operates at the output level. For each DVC output that was pushed, the system rebuilds the remote data index to capture the current metadata (including version IDs assigned by the cloud provider), then merges this metadata into the output's recorded state. For directory outputs, this involves iterating over every file entry within the directory tree, matching each local file to its remote counterpart, and preserving existing version IDs for files that were not modified (to minimize merge conflicts in the .dvc or dvc.lock files).

This principle is critical for worktree remotes -- remotes where files are stored in their original tree structure with cloud-native versioning, rather than in a content-addressed flat structure. In worktree mode, the version ID is the primary mechanism for identifying the correct version of a file on the remote, making metadata propagation essential for correctness.

After updating all affected outputs, the modified metadata is dumped (persisted) back to the DVC stage files (dvc.yaml/dvc.lock or .dvc files). This ensures that the next commit captures the remote metadata, enabling other team members to pull the exact versions that were pushed.

Usage

Use data index update when:

  • Pushing data to version-aware cloud storage where version IDs must be recorded for future pulls.
  • Working with worktree remotes that store data in tree structure with cloud-native versioning.
  • Maintaining bidirectional sync correctness where the push metadata must be available for subsequent pull operations.
  • Building custom push workflows that interact with version-aware storage backends.
  • Ensuring that DVC metafiles contain the latest remote version information after a push.

Theoretical Basis

The update follows a push-then-reconcile pattern:

UPDATE_META(index, targets):
    stages_to_dump = set()

    for (remote_name, filtered_view) in worktree_view_by_remotes(index):
        remote = get_remote(remote_name)

        # Skip non-version-aware remotes (they use content-addressed hashes)
        if not remote.fs.version_aware:
            continue

        # Rebuild the remote index to capture post-push metadata
        new_remote_index = rebuild(
            filtered_view.data["repo"],
            remote.path,
            remote.fs
        )

        # Merge the new remote metadata back into each output
        for out in filtered_view.outs:
            merge_push_meta(out, new_remote_index, remote.name)
            stages_to_dump.add(out.stage)

    # Persist updated metadata to DVC stage files
    for stage in stages_to_dump:
        stage.dump(with_files=True, update_pipeline=False)

The merge_push_meta function handles the per-output reconciliation:

MERGE_PUSH_META(out, index, remote_name):
    entry = index.get(out.key)
    if entry is None:
        return  # output not found in remote index

    if out.is_directory:
        old_tree = out.get_obj()

        # For each file in the directory
        for subkey, entry in index.iterate(out.key):
            if entry.is_directory:
                continue

            # Look up the existing version in the old tree
            old_meta, hash_info = old_tree.get(relative_path)
            entry.hash_info = hash_info

            # Preserve existing version IDs for unchanged files
            if old_meta.version_id is not None:
                entry.meta = old_meta
            entry.meta.remote = remote_name

        # Rebuild the tree hash from the updated index
        tree_meta, new_tree = build_tree(index, out.key)
        out.obj = new_tree
        out.hash_info = new_tree.hash_info
        out.meta = tree_meta
    else:
        # For single files, directly adopt the remote metadata
        if entry.hash_info:
            out.hash_info = entry.hash_info
        if out.meta is None or out.meta.version_id is None:
            out.meta = entry.meta

    out.meta.remote = remote_name

Key design properties:

  • Version preservation: Unchanged files retain their existing version IDs, reducing noise in metafile diffs.
  • Remote tagging: Each output's metadata is tagged with the remote name, enabling multi-remote workflows.
  • Atomic persistence: All metadata changes are dumped to stage files together, ensuring consistency.
  • Selective application: Only version-aware remotes trigger metadata updates; traditional content-addressed remotes skip this step entirely.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment