Principle:Iterative Dvc Local Remote State Comparison
| Knowledge Sources | |
|---|---|
| Domains | Data_Synchronization, State_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Local-remote state comparison is the technique of computing the symmetric difference between a local content-addressed cache and a remote storage backend to categorize every tracked object as present, missing, new, or deleted.
Description
Before transferring data between a local cache and a remote object store, the system must determine the current state of every tracked object on both sides. This avoids redundant transfers, identifies missing data that cannot be recovered, and provides users with an accurate picture of their synchronization status. The comparison operates entirely on hash identifiers (content-addressed names), not file paths or timestamps, making it robust against renames, moves, and clock drift.
The comparison produces four disjoint sets that together partition the universe of tracked objects:
- ok -- present in both local cache and remote storage; no transfer needed.
- missing -- absent from both local and remote; the data is lost and cannot be recovered through synchronization alone.
- new -- present in local cache but absent from remote; candidates for upload (push).
- deleted -- present in remote storage but absent from local cache; candidates for download (fetch/pull).
This four-way classification is the foundation for all bidirectional sync decisions. A push operation transfers the "new" set to the remote. A fetch/pull operation transfers the "deleted" set (from the local perspective, these are objects that exist remotely but not locally). The "missing" set triggers warnings because the data is unrecoverable from either location. The "ok" set is simply skipped.
The comparison also handles legacy hash migration transparently. Objects using the older md5-dos2unix hash algorithm are separated from those using the current default algorithm, and each group is compared against the appropriate object database on the remote side.
Usage
Use local-remote state comparison when:
- Running a pre-transfer status check to display what would be pushed or pulled without actually transferring.
- Validating that all tracked data is present in at least one location (local or remote).
- Diagnosing data loss by identifying objects in the "missing" set.
- Optimizing transfer operations by computing the minimal set of objects that need to move.
- Building monitoring dashboards that track synchronization completeness across repositories.
Theoretical Basis
The comparison is a set-theoretic symmetric difference applied to content hashes:
COMPARE_STATUS(local_cache, remote_odb, tracked_hashes):
local_set = { h for h in tracked_hashes if exists(local_cache, h) }
remote_set = { h for h in tracked_hashes if exists(remote_odb, h) }
ok = local_set INTERSECT remote_set
new = local_set MINUS remote_set # push candidates
deleted = remote_set MINUS local_set # fetch candidates
missing = tracked_hashes MINUS (local_set UNION remote_set)
return CompareStatusResult(ok, missing, new, deleted)
For efficiency, the existence checks against the remote are batched. The remote object database maintains an optional index (a cached listing of known hashes) that avoids individual HEAD requests for every object. When available, the index converts remote existence checks from O(n) network round-trips to a single local set lookup.
The algorithm handles hash algorithm partitioning as follows:
PARTITIONED_STATUS(objs, remote):
legacy_objs, default_objs = split_by_hash_algorithm(objs)
result = empty CompareStatusResult
if legacy_objs:
result += compare_status(legacy_cache, remote.legacy_odb, legacy_objs)
if default_objs:
result += compare_status(default_cache, remote.odb, default_objs)
return result
Key properties:
- Commutativity: The classification is symmetric -- swapping local and remote merely swaps "new" and "deleted".
- Idempotency: Running the comparison multiple times without intervening transfers always produces the same result.
- Completeness: Every tracked hash appears in exactly one of the four result sets.