Implementation:Iterative Dvc DataCloud Status

Knowledge Sources	DVC
Domains	Data_Synchronization, State_Management
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for comparing local cache state against remote storage to classify tracked objects as ok, missing, new, or deleted, provided by the DVC library.

Description

The DataCloud.status method is a wrapper around the lower-level dvc_data.hashfile.status.compare_status function. It partitions incoming hash objects by their hash algorithm (legacy md5-dos2unix vs. the current default), resolves the appropriate local cache and remote object database for each partition, and delegates the actual comparison to the dvc-data library. The results from both partitions are merged into a single CompareStatusResult named tuple.

The companion DataCloud.transfer method (L157-166) provides a thin wrapper around dvc_data.hashfile.transfer.transfer that is used by push and pull after the status comparison determines what needs to move. The status method itself is a read-only diagnostic that does not modify any state.

This implementation lives in dvc/data_cloud.py and is part of the DataCloud class, which centralizes all remote interaction logic. The method is invoked by the dvc status --cloud command to display the synchronization state to users.

Usage

Import and use DataCloud.status when you need to determine the synchronization state between local and remote without performing any transfers. This is useful for dry-run checks, monitoring scripts, and pre-transfer validation. The method is also used internally to generate warning messages about missing cache files during push and pull operations.

Code Reference

Source Location

Repository: DVC
File: dvc/data_cloud.py
Lines: L290-328 (status), L330-350 (_status)
Supporting method: L157-166 (transfer)

Signature

def status(
    self,
    objs: Iterable["HashInfo"],
    jobs: Optional[int] = None,
    remote: Optional[str] = None,
    odb: Optional["HashFileDB"] = None,
) -> "CompareStatusResult":

Import

from dvc.data_cloud import DataCloud

I/O Contract

Inputs

Name	Type	Required	Description
objs	Iterable[HashInfo]	Yes	Collection of hash identifiers for tracked data objects to check status for.
jobs	Optional[int]	No	Number of parallel jobs for remote existence checks. Defaults to the system/config default.
remote	Optional[str]	No	Name of the remote to compare against. If None, the default remote is used.
odb	Optional[HashFileDB]	No	Optional object database to compare against directly, bypassing remote resolution.

Outputs

Name	Type	Description
return	CompareStatusResult	A named tuple with four set attributes: ok (present in both local and remote), missing (absent from both), new (present locally only, push candidates), deleted (present remotely only, fetch candidates). Each set contains HashInfo objects.

Usage Examples

Basic Usage

from dvc.repo import Repo

repo = Repo()

# Collect all tracked hash objects from the index
used = repo.index.used_objs()
all_hashes = set()
for odb, objs in used.items():
    all_hashes.update(objs)

# Compare local vs. remote state
result = repo.cloud.status(all_hashes, remote="myremote")

print(f"Synchronized: {len(result.ok)}")
print(f"Push needed:  {len(result.new)}")
print(f"Pull needed:  {len(result.deleted)}")
print(f"Missing:      {len(result.missing)}")

Checking Status for a Specific Remote

from dvc.repo import Repo

repo = Repo()

# Get the hash objects used by specific targets
used = repo.index.used_objs(targets=["data/processed/"])
all_hashes = set()
for odb, objs in used.items():
    all_hashes.update(objs)

# Check against a named remote with parallel jobs
result = repo.cloud.status(all_hashes, remote="s3remote", jobs=8)

if result.missing:
    for hi in result.missing:
        print(f"WARNING: Missing from both local and remote: {hi}")

Related Pages

Implements Principle

Principle:Iterative_Dvc_Local_Remote_State_Comparison

Requires Environment

Environment:Iterative_Dvc_Remote_Storage_Backends

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment