Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Repo Du

From Leeroopedia


Knowledge Sources
Domains Data_Management, Utilities
Last Updated 2026-02-10 10:00 GMT

Overview

The Repo_Du implementation calculates disk usage for paths within a DVC repository. It resides in dvc/repo/du.py (42 lines) and provides functionality analogous to the Unix du command but operating on DVC-tracked data.

from dvc.repo.du import du

Function Signature

def du(
    url: str,
    path: Optional[str] = None,
    rev: Optional[str] = None,
    summarize: bool = False,
    config: Union[dict[str, Any], str, None] = None,
    remote: Optional[str] = None,
    remote_config: Optional[dict] = None,
):

Parameters

Parameter Type Default Description
url str required URL or path to the DVC repository
path Optional[str] None Specific path within the repository to measure; defaults to root
rev Optional[str] None Git revision (branch, tag, or commit) to inspect
summarize bool False If True, returns only the total size for the path rather than per-entry sizes
config Union[dict, str, None] None Configuration dictionary or path to a config file
remote Optional[str] None Name of the DVC remote to use
remote_config Optional[dict] None Additional remote-specific configuration

Return Value

Returns a list of tuples where each tuple contains:

Index Type Description
0 str The path of the entry
1 int The total disk usage in bytes for that entry

When summarize=True or the target path is a file, a single-element list is returned. When the target is a directory, the list contains one entry per child plus a final summary entry with the total.

Internal Mechanics

Configuration Loading

If config is provided as a string (file path) rather than a dictionary, it is loaded using Config.load_file:

from dvc.config import Config

if config and not isinstance(config, dict):
    config_dict = Config.load_file(config)
else:
    config_dict = None

Repository Access

The function opens the repository using Repo.open with subrepos=True and uninitialized=True, allowing it to operate on repositories that may contain sub-repositories or that lack full initialization:

with Repo.open(
    url,
    rev=rev,
    subrepos=True,
    uninitialized=True,
    config=config_dict,
    remote=remote,
    remote_config=remote_config,
) as repo:

Size Calculation

The function uses the repo.dvcfs (DVC filesystem) to calculate sizes:

  • For files or when summarize=True: calls fs.du(path, total=True) once.
  • For directories: iterates over fs.ls(path), calculates the size for each child entry, then appends a summary total.

Usage Example

from dvc.repo.du import du

# Get disk usage for a remote repository path
entries = du("https://github.com/example/repo", path="data/")

# Summarized usage for a specific revision
entries = du("/path/to/repo", path="models/", rev="v1.0", summarize=True)

for path, size in entries:
    print(f"{path}: {size} bytes")

Dependencies

Module Purpose
dvc.config.Config Loading configuration from file paths
dvc.repo.Repo Opening and interacting with the DVC repository
repo.dvcfs The DVC filesystem used for du, ls, and isdir operations

See Also

  • Repo_Get -- Also uses Repo.open pattern for remote repository access
  • Repo_Diff -- Another data inspection function operating on repository state

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment