Implementation:Iterative Dvc Repo Data
Domains
Data_Management, Version_Control
Overview
Concrete tool for computing data status and diffs across DVC repository revisions. The module dvc/repo/data.py provides the primary interface for determining what data has changed between the index, the working tree, and the HEAD commit. It exposes two core operations: status() for comprehensive repository state and _diff() for computing differences between two DataIndex objects.
Description
The module manages data index operations for a DVC repository. The status() function returns a comprehensive Status TypedDict containing committed changes (HEAD vs index), uncommitted changes (index vs working tree), untracked files, unchanged entries, git metadata, and lists of entries not present in cache or remote storage.
The _diff() function computes differences between two BaseDataIndex objects, categorizing each entry as added, deleted, modified, renamed, unchanged, or unknown. It also optionally checks whether old entries exist in the cache, populating the not_in_cache field.
Key TypedDicts:
| TypedDict | Fields | Purpose |
|---|---|---|
DiffResult |
modified, added, deleted, renamed, unchanged, unknown, not_in_cache |
Represents the diff between two data indices |
Status |
not_in_cache, not_in_remote, committed, uncommitted, untracked, unchanged, git |
Complete repository data status |
GitInfo |
staged, unstaged, untracked, is_empty, is_dirty |
Git SCM state information |
Rename |
old, new |
Represents a single rename entry within a diff |
Internal helpers:
_diff_index_to_wtree()-- builds a workspace data index, then diffs it against the repo index to produce uncommitted changes._diff_head_to_index()-- switches to HEAD (or a specified revision), then diffs the head index against the current index for committed changes._get_entries_not_in_remote()-- checks remote storage for missing entries using bulk existence queries._git_info()-- queries the SCM for staged, unstaged, and untracked files plus empty/dirty flags.
Signature
def status(
repo: "Repo",
targets: Optional[Iterable[Union[os.PathLike[str], str]]] = None,
*,
granular: bool = False,
untracked_files: str = "no",
remote: Optional[str] = None,
not_in_remote: bool = False,
remote_refresh: bool = False,
config: Optional[dict] = None,
batch_size: Optional[int] = None,
head: str = "HEAD",
with_renames: bool = True,
) -> Status:
...
def _diff(
old: "BaseDataIndex",
new: "BaseDataIndex",
*,
filter_keys: Optional[Iterable["DataIndexKey"]] = None,
granular: bool = False,
not_in_cache: bool = False,
batch_size: Optional[int] = None,
callback: "Callback" = DEFAULT_CALLBACK,
with_renames: bool = False,
) -> DiffResult:
...
Import
from dvc.repo.data import status
Input/Output
| Function | Input | Output |
|---|---|---|
status() |
repo: Repo -- the DVC repository instance; targets -- optional iterable of paths to filter; granular -- if True, report file-level diffs inside directories; untracked_files -- controls untracked file listing ("no", "all"); remote -- remote name for not_in_remote checks; not_in_remote -- enable remote missing check; remote_refresh -- refresh remote index; config -- additional config overrides; batch_size -- batch size for existence checks; head -- revision to compare against (default "HEAD"); with_renames -- detect renames |
Status TypedDict with keys: not_in_cache (list[str]), not_in_remote (list[str]), committed (DiffResult), uncommitted (DiffResult), untracked (list[str]), unchanged (list[str]), git (GitInfo)
|
_diff() |
old, new -- two BaseDataIndex objects; filter_keys -- optional key filter; granular -- file-level diff; not_in_cache -- check cache existence; batch_size, callback, with_renames |
DiffResult TypedDict with categorized path lists
|
Example
from dvc.repo import Repo
from dvc.repo.data import status
with Repo() as repo:
result = status(
repo,
targets=["data/"],
granular=True,
untracked_files="all",
not_in_remote=True,
remote="myremote",
)
# Inspect uncommitted changes
for path in result["uncommitted"].get("modified", []):
print(f"Modified (uncommitted): {path}")
# Inspect committed changes
for path in result["committed"].get("added", []):
print(f"Added (committed): {path}")
# Check what is missing from remote
for path in result["not_in_remote"]:
print(f"Not in remote: {path}")
# Git status info
if result["git"].get("is_dirty"):
print("Working tree has uncommitted Git changes")