Implementation:Iterative Dvc DataCloud Status
| Knowledge Sources | |
|---|---|
| Domains | Data_Synchronization, State_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for comparing local cache state against remote storage to classify tracked objects as ok, missing, new, or deleted, provided by the DVC library.
Description
The DataCloud.status method is a wrapper around the lower-level dvc_data.hashfile.status.compare_status function. It partitions incoming hash objects by their hash algorithm (legacy md5-dos2unix vs. the current default), resolves the appropriate local cache and remote object database for each partition, and delegates the actual comparison to the dvc-data library. The results from both partitions are merged into a single CompareStatusResult named tuple.
The companion DataCloud.transfer method (L157-166) provides a thin wrapper around dvc_data.hashfile.transfer.transfer that is used by push and pull after the status comparison determines what needs to move. The status method itself is a read-only diagnostic that does not modify any state.
This implementation lives in dvc/data_cloud.py and is part of the DataCloud class, which centralizes all remote interaction logic. The method is invoked by the dvc status --cloud command to display the synchronization state to users.
Usage
Import and use DataCloud.status when you need to determine the synchronization state between local and remote without performing any transfers. This is useful for dry-run checks, monitoring scripts, and pre-transfer validation. The method is also used internally to generate warning messages about missing cache files during push and pull operations.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/data_cloud.py - Lines: L290-328 (status), L330-350 (_status)
- Supporting method: L157-166 (transfer)
Signature
def status(
self,
objs: Iterable["HashInfo"],
jobs: Optional[int] = None,
remote: Optional[str] = None,
odb: Optional["HashFileDB"] = None,
) -> "CompareStatusResult":
Import
from dvc.data_cloud import DataCloud
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| objs | Iterable[HashInfo] | Yes | Collection of hash identifiers for tracked data objects to check status for. |
| jobs | Optional[int] | No | Number of parallel jobs for remote existence checks. Defaults to the system/config default. |
| remote | Optional[str] | No | Name of the remote to compare against. If None, the default remote is used. |
| odb | Optional[HashFileDB] | No | Optional object database to compare against directly, bypassing remote resolution. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | CompareStatusResult | A named tuple with four set attributes: ok (present in both local and remote), missing (absent from both), new (present locally only, push candidates), deleted (present remotely only, fetch candidates). Each set contains HashInfo objects. |
Usage Examples
Basic Usage
from dvc.repo import Repo
repo = Repo()
# Collect all tracked hash objects from the index
used = repo.index.used_objs()
all_hashes = set()
for odb, objs in used.items():
all_hashes.update(objs)
# Compare local vs. remote state
result = repo.cloud.status(all_hashes, remote="myremote")
print(f"Synchronized: {len(result.ok)}")
print(f"Push needed: {len(result.new)}")
print(f"Pull needed: {len(result.deleted)}")
print(f"Missing: {len(result.missing)}")
Checking Status for a Specific Remote
from dvc.repo import Repo
repo = Repo()
# Get the hash objects used by specific targets
used = repo.index.used_objs(targets=["data/processed/"])
all_hashes = set()
for odb, objs in used.items():
all_hashes.update(objs)
# Check against a named remote with parallel jobs
result = repo.cloud.status(all_hashes, remote="s3remote", jobs=8)
if result.missing:
for hi in result.missing:
print(f"WARNING: Missing from both local and remote: {hi}")