Principle:Iterative Dvc Local Remote State Comparison

Knowledge Sources	DVC Documentation
Domains	Data_Synchronization, State_Management
Last Updated	2026-02-10 00:00 GMT

Overview

Local-remote state comparison is the technique of computing the symmetric difference between a local content-addressed cache and a remote storage backend to categorize every tracked object as present, missing, new, or deleted.

Description

Before transferring data between a local cache and a remote object store, the system must determine the current state of every tracked object on both sides. This avoids redundant transfers, identifies missing data that cannot be recovered, and provides users with an accurate picture of their synchronization status. The comparison operates entirely on hash identifiers (content-addressed names), not file paths or timestamps, making it robust against renames, moves, and clock drift.

The comparison produces four disjoint sets that together partition the universe of tracked objects:

ok -- present in both local cache and remote storage; no transfer needed.
missing -- absent from both local and remote; the data is lost and cannot be recovered through synchronization alone.
new -- present in local cache but absent from remote; candidates for upload (push).
deleted -- present in remote storage but absent from local cache; candidates for download (fetch/pull).

This four-way classification is the foundation for all bidirectional sync decisions. A push operation transfers the "new" set to the remote. A fetch/pull operation transfers the "deleted" set (from the local perspective, these are objects that exist remotely but not locally). The "missing" set triggers warnings because the data is unrecoverable from either location. The "ok" set is simply skipped.

The comparison also handles legacy hash migration transparently. Objects using the older md5-dos2unix hash algorithm are separated from those using the current default algorithm, and each group is compared against the appropriate object database on the remote side.

Usage

Use local-remote state comparison when:

Running a pre-transfer status check to display what would be pushed or pulled without actually transferring.
Validating that all tracked data is present in at least one location (local or remote).
Diagnosing data loss by identifying objects in the "missing" set.
Optimizing transfer operations by computing the minimal set of objects that need to move.
Building monitoring dashboards that track synchronization completeness across repositories.

Theoretical Basis

The comparison is a set-theoretic symmetric difference applied to content hashes:

COMPARE_STATUS(local_cache, remote_odb, tracked_hashes):
    local_set  = { h for h in tracked_hashes if exists(local_cache, h) }
    remote_set = { h for h in tracked_hashes if exists(remote_odb, h) }

    ok       = local_set  INTERSECT  remote_set
    new      = local_set  MINUS      remote_set     # push candidates
    deleted  = remote_set MINUS      local_set      # fetch candidates
    missing  = tracked_hashes MINUS (local_set UNION remote_set)

    return CompareStatusResult(ok, missing, new, deleted)

For efficiency, the existence checks against the remote are batched. The remote object database maintains an optional index (a cached listing of known hashes) that avoids individual HEAD requests for every object. When available, the index converts remote existence checks from O(n) network round-trips to a single local set lookup.

The algorithm handles hash algorithm partitioning as follows:

PARTITIONED_STATUS(objs, remote):
    legacy_objs, default_objs = split_by_hash_algorithm(objs)
    result = empty CompareStatusResult

    if legacy_objs:
        result += compare_status(legacy_cache, remote.legacy_odb, legacy_objs)
    if default_objs:
        result += compare_status(default_cache, remote.odb, default_objs)

    return result

Key properties:

Commutativity: The classification is symmetric -- swapping local and remote merely swaps "new" and "deleted".
Idempotency: Running the comparison multiple times without intervening transfers always produces the same result.
Completeness: Every tracked hash appears in exactly one of the four result sets.

Related Pages

Implemented By

Implementation:Iterative_Dvc_DataCloud_Status

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment