Principle:Iterative Dvc Data Index Update
| Knowledge Sources | |
|---|---|
| Domains | Data_Synchronization, Index_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Data index update is the process of merging remote storage metadata -- such as version IDs and checksums -- back into local data index entries after push operations, ensuring that subsequent pull and status operations reference the correct remote objects.
Description
When data is pushed to a version-aware remote (such as an S3 bucket with versioning enabled), the remote assigns new metadata to each uploaded object: a unique version ID, an ETag (checksum), and other provider-specific attributes. If this metadata is not captured and recorded back into the local DVC metafiles, subsequent operations cannot correctly identify the remote objects. The data index update principle addresses this by defining a systematic process for propagating remote metadata back into the local index after a successful push.
The update operates at the output level. For each DVC output that was pushed, the system rebuilds the remote data index to capture the current metadata (including version IDs assigned by the cloud provider), then merges this metadata into the output's recorded state. For directory outputs, this involves iterating over every file entry within the directory tree, matching each local file to its remote counterpart, and preserving existing version IDs for files that were not modified (to minimize merge conflicts in the .dvc or dvc.lock files).
This principle is critical for worktree remotes -- remotes where files are stored in their original tree structure with cloud-native versioning, rather than in a content-addressed flat structure. In worktree mode, the version ID is the primary mechanism for identifying the correct version of a file on the remote, making metadata propagation essential for correctness.
After updating all affected outputs, the modified metadata is dumped (persisted) back to the DVC stage files (dvc.yaml/dvc.lock or .dvc files). This ensures that the next commit captures the remote metadata, enabling other team members to pull the exact versions that were pushed.
Usage
Use data index update when:
- Pushing data to version-aware cloud storage where version IDs must be recorded for future pulls.
- Working with worktree remotes that store data in tree structure with cloud-native versioning.
- Maintaining bidirectional sync correctness where the push metadata must be available for subsequent pull operations.
- Building custom push workflows that interact with version-aware storage backends.
- Ensuring that DVC metafiles contain the latest remote version information after a push.
Theoretical Basis
The update follows a push-then-reconcile pattern:
UPDATE_META(index, targets):
stages_to_dump = set()
for (remote_name, filtered_view) in worktree_view_by_remotes(index):
remote = get_remote(remote_name)
# Skip non-version-aware remotes (they use content-addressed hashes)
if not remote.fs.version_aware:
continue
# Rebuild the remote index to capture post-push metadata
new_remote_index = rebuild(
filtered_view.data["repo"],
remote.path,
remote.fs
)
# Merge the new remote metadata back into each output
for out in filtered_view.outs:
merge_push_meta(out, new_remote_index, remote.name)
stages_to_dump.add(out.stage)
# Persist updated metadata to DVC stage files
for stage in stages_to_dump:
stage.dump(with_files=True, update_pipeline=False)
The merge_push_meta function handles the per-output reconciliation:
MERGE_PUSH_META(out, index, remote_name):
entry = index.get(out.key)
if entry is None:
return # output not found in remote index
if out.is_directory:
old_tree = out.get_obj()
# For each file in the directory
for subkey, entry in index.iterate(out.key):
if entry.is_directory:
continue
# Look up the existing version in the old tree
old_meta, hash_info = old_tree.get(relative_path)
entry.hash_info = hash_info
# Preserve existing version IDs for unchanged files
if old_meta.version_id is not None:
entry.meta = old_meta
entry.meta.remote = remote_name
# Rebuild the tree hash from the updated index
tree_meta, new_tree = build_tree(index, out.key)
out.obj = new_tree
out.hash_info = new_tree.hash_info
out.meta = tree_meta
else:
# For single files, directly adopt the remote metadata
if entry.hash_info:
out.hash_info = entry.hash_info
if out.meta is None or out.meta.version_id is None:
out.meta = entry.meta
out.meta.remote = remote_name
Key design properties:
- Version preservation: Unchanged files retain their existing version IDs, reducing noise in metafile diffs.
- Remote tagging: Each output's metadata is tagged with the remote name, enabling multi-remote workflows.
- Atomic persistence: All metadata changes are dumped to stage files together, ensuring consistency.
- Selective application: Only version-aware remotes trigger metadata updates; traditional content-addressed remotes skip this step entirely.