Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Repo Diff

From Leeroopedia
Revision as of 15:19, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Iterative_Dvc_Repo_Diff.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Management, Version_Control
Last Updated 2026-02-10 10:00 GMT

Overview

The Repo_Diff implementation compares the state of DVC-tracked files between two revisions or between the workspace and a commit. It resides in dvc/repo/diff.py (161 lines) and is the core logic behind the dvc diff command.

from dvc.repo.diff import diff

Function Signature

@locked
def diff(
    self,
    a_rev: str = "HEAD",
    b_rev: Optional[str] = None,
    targets: Optional[list[str]] = None,
    recursive: bool = False,
):

Parameters

Parameter Type Default Description
self Repo N/A The DVC repository instance
a_rev str "HEAD" The base revision for comparison
b_rev Optional[str] None The target revision; if None, the current workspace is used
targets Optional[list[str]] None Specific paths to restrict the diff to
recursive bool False Whether to recursively search targets for DVC-tracked files

Return Value

The function returns a dictionary with the following keys, each mapping to a list of change entries:

Key Description
added Files that exist in b_rev but not in a_rev
deleted Files that exist in a_rev but not in b_rev
modified Files present in both revisions with different hash values
renamed Files that were renamed between revisions (same hash, different path)
not in cache Files whose data is missing from the local cache (only when comparing against workspace)

If there are no differences, an empty dictionary is returned.

Internal Mechanics

Helper Functions

The module defines two helper functions used to extract information from index entries:

  • _path(entry) -- Returns the file path from an index entry, appending a trailing separator for directories.
  • _hash(entry) -- Returns the hash value from an index entry, or None if unavailable.

Core Diff Logic (_diff)

The _diff function delegates to dvc_data.index.diff.diff (aliased as idiff) and categorizes each change by its type:

from dvc_data.index.diff import ADD, DELETE, MODIFY, RENAME
from dvc_data.index.diff import diff as idiff

Key behaviors:

  • Rename detection is enabled via with_renames=True.
  • Unknown entries are included to avoid false positives from missing directory entries.
  • Unchanged entries are included when with_missing=True to check if data exists in cache.

Revision Branching

The diff function uses self.brancher(revs=[a_rev, b_rev]) to iterate over the specified revisions. For the workspace revision, it calls build_data_index with compute_hash=True to build a fresh data index. For committed revisions, it reads directly from view.data["repo"].

Missing Target Handling

If specific targets are provided and a target is missing from both revisions, a FileNotFoundError is raised. Targets missing from only one revision are handled gracefully as additions or deletions.

Usage Example

from dvc.repo import Repo

with Repo() as repo:
    # Compare workspace against HEAD
    result = repo.diff()

    # Compare two specific tags
    result = repo.diff(a_rev="v1.0", b_rev="v2.0")

    # Diff specific targets recursively
    result = repo.diff(targets=["data/"], recursive=True)

Dependencies

Module Purpose
dvc_data.index.diff Provides the low-level index diffing with ADD, DELETE, MODIFY, RENAME change types
dvc.repo.locked Decorator ensuring the repository lock is held during execution
dvc.ui Provides status display during workspace index building and diff calculation
dvc.repo.index.build_data_index Builds a data index for the workspace with computed hashes

See Also

  • Repo_Gc -- Garbage collection uses similar index traversal logic
  • Repo_Du -- Disk usage also operates on repository data

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment