Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Iterative Dvc Repo Data

From Leeroopedia


Domains

Data_Management, Version_Control

Overview

Concrete tool for computing data status and diffs across DVC repository revisions. The module dvc/repo/data.py provides the primary interface for determining what data has changed between the index, the working tree, and the HEAD commit. It exposes two core operations: status() for comprehensive repository state and _diff() for computing differences between two DataIndex objects.

Description

The module manages data index operations for a DVC repository. The status() function returns a comprehensive Status TypedDict containing committed changes (HEAD vs index), uncommitted changes (index vs working tree), untracked files, unchanged entries, git metadata, and lists of entries not present in cache or remote storage.

The _diff() function computes differences between two BaseDataIndex objects, categorizing each entry as added, deleted, modified, renamed, unchanged, or unknown. It also optionally checks whether old entries exist in the cache, populating the not_in_cache field.

Key TypedDicts:

TypedDict Fields Purpose
DiffResult modified, added, deleted, renamed, unchanged, unknown, not_in_cache Represents the diff between two data indices
Status not_in_cache, not_in_remote, committed, uncommitted, untracked, unchanged, git Complete repository data status
GitInfo staged, unstaged, untracked, is_empty, is_dirty Git SCM state information
Rename old, new Represents a single rename entry within a diff

Internal helpers:

  • _diff_index_to_wtree() -- builds a workspace data index, then diffs it against the repo index to produce uncommitted changes.
  • _diff_head_to_index() -- switches to HEAD (or a specified revision), then diffs the head index against the current index for committed changes.
  • _get_entries_not_in_remote() -- checks remote storage for missing entries using bulk existence queries.
  • _git_info() -- queries the SCM for staged, unstaged, and untracked files plus empty/dirty flags.

Signature

def status(
    repo: "Repo",
    targets: Optional[Iterable[Union[os.PathLike[str], str]]] = None,
    *,
    granular: bool = False,
    untracked_files: str = "no",
    remote: Optional[str] = None,
    not_in_remote: bool = False,
    remote_refresh: bool = False,
    config: Optional[dict] = None,
    batch_size: Optional[int] = None,
    head: str = "HEAD",
    with_renames: bool = True,
) -> Status:
    ...
def _diff(
    old: "BaseDataIndex",
    new: "BaseDataIndex",
    *,
    filter_keys: Optional[Iterable["DataIndexKey"]] = None,
    granular: bool = False,
    not_in_cache: bool = False,
    batch_size: Optional[int] = None,
    callback: "Callback" = DEFAULT_CALLBACK,
    with_renames: bool = False,
) -> DiffResult:
    ...

Import

from dvc.repo.data import status

Input/Output

Function Input Output
status() repo: Repo -- the DVC repository instance; targets -- optional iterable of paths to filter; granular -- if True, report file-level diffs inside directories; untracked_files -- controls untracked file listing ("no", "all"); remote -- remote name for not_in_remote checks; not_in_remote -- enable remote missing check; remote_refresh -- refresh remote index; config -- additional config overrides; batch_size -- batch size for existence checks; head -- revision to compare against (default "HEAD"); with_renames -- detect renames Status TypedDict with keys: not_in_cache (list[str]), not_in_remote (list[str]), committed (DiffResult), uncommitted (DiffResult), untracked (list[str]), unchanged (list[str]), git (GitInfo)
_diff() old, new -- two BaseDataIndex objects; filter_keys -- optional key filter; granular -- file-level diff; not_in_cache -- check cache existence; batch_size, callback, with_renames DiffResult TypedDict with categorized path lists

Example

from dvc.repo import Repo
from dvc.repo.data import status

with Repo() as repo:
    result = status(
        repo,
        targets=["data/"],
        granular=True,
        untracked_files="all",
        not_in_remote=True,
        remote="myremote",
    )

    # Inspect uncommitted changes
    for path in result["uncommitted"].get("modified", []):
        print(f"Modified (uncommitted): {path}")

    # Inspect committed changes
    for path in result["committed"].get("added", []):
        print(f"Added (committed): {path}")

    # Check what is missing from remote
    for path in result["not_in_remote"]:
        print(f"Not in remote: {path}")

    # Git status info
    if result["git"].get("is_dirty"):
        print("Working tree has uncommitted Git changes")

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment