Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Iterative Dvc DataCloud Status

From Leeroopedia


Knowledge Sources
Domains Data_Synchronization, State_Management
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for comparing local cache state against remote storage to classify tracked objects as ok, missing, new, or deleted, provided by the DVC library.

Description

The DataCloud.status method is a wrapper around the lower-level dvc_data.hashfile.status.compare_status function. It partitions incoming hash objects by their hash algorithm (legacy md5-dos2unix vs. the current default), resolves the appropriate local cache and remote object database for each partition, and delegates the actual comparison to the dvc-data library. The results from both partitions are merged into a single CompareStatusResult named tuple.

The companion DataCloud.transfer method (L157-166) provides a thin wrapper around dvc_data.hashfile.transfer.transfer that is used by push and pull after the status comparison determines what needs to move. The status method itself is a read-only diagnostic that does not modify any state.

This implementation lives in dvc/data_cloud.py and is part of the DataCloud class, which centralizes all remote interaction logic. The method is invoked by the dvc status --cloud command to display the synchronization state to users.

Usage

Import and use DataCloud.status when you need to determine the synchronization state between local and remote without performing any transfers. This is useful for dry-run checks, monitoring scripts, and pre-transfer validation. The method is also used internally to generate warning messages about missing cache files during push and pull operations.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/data_cloud.py
  • Lines: L290-328 (status), L330-350 (_status)
  • Supporting method: L157-166 (transfer)

Signature

def status(
    self,
    objs: Iterable["HashInfo"],
    jobs: Optional[int] = None,
    remote: Optional[str] = None,
    odb: Optional["HashFileDB"] = None,
) -> "CompareStatusResult":

Import

from dvc.data_cloud import DataCloud

I/O Contract

Inputs

Name Type Required Description
objs Iterable[HashInfo] Yes Collection of hash identifiers for tracked data objects to check status for.
jobs Optional[int] No Number of parallel jobs for remote existence checks. Defaults to the system/config default.
remote Optional[str] No Name of the remote to compare against. If None, the default remote is used.
odb Optional[HashFileDB] No Optional object database to compare against directly, bypassing remote resolution.

Outputs

Name Type Description
return CompareStatusResult A named tuple with four set attributes: ok (present in both local and remote), missing (absent from both), new (present locally only, push candidates), deleted (present remotely only, fetch candidates). Each set contains HashInfo objects.

Usage Examples

Basic Usage

from dvc.repo import Repo

repo = Repo()

# Collect all tracked hash objects from the index
used = repo.index.used_objs()
all_hashes = set()
for odb, objs in used.items():
    all_hashes.update(objs)

# Compare local vs. remote state
result = repo.cloud.status(all_hashes, remote="myremote")

print(f"Synchronized: {len(result.ok)}")
print(f"Push needed:  {len(result.new)}")
print(f"Pull needed:  {len(result.deleted)}")
print(f"Missing:      {len(result.missing)}")

Checking Status for a Specific Remote

from dvc.repo import Repo

repo = Repo()

# Get the hash objects used by specific targets
used = repo.index.used_objs(targets=["data/processed/"])
all_hashes = set()
for odb, objs in used.items():
    all_hashes.update(objs)

# Check against a named remote with parallel jobs
result = repo.cloud.status(all_hashes, remote="s3remote", jobs=8)

if result.missing:
    for hi in result.missing:
        print(f"WARNING: Missing from both local and remote: {hi}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment