Implementation:Iterative Dvc Repo Du
| Knowledge Sources | |
|---|---|
| Domains | Data_Management, Utilities |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
The Repo_Du implementation calculates disk usage for paths within a DVC repository. It resides in dvc/repo/du.py (42 lines) and provides functionality analogous to the Unix du command but operating on DVC-tracked data.
from dvc.repo.du import du
Function Signature
def du(
url: str,
path: Optional[str] = None,
rev: Optional[str] = None,
summarize: bool = False,
config: Union[dict[str, Any], str, None] = None,
remote: Optional[str] = None,
remote_config: Optional[dict] = None,
):
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str |
required | URL or path to the DVC repository |
path |
Optional[str] |
None |
Specific path within the repository to measure; defaults to root |
rev |
Optional[str] |
None |
Git revision (branch, tag, or commit) to inspect |
summarize |
bool |
False |
If True, returns only the total size for the path rather than per-entry sizes
|
config |
Union[dict, str, None] |
None |
Configuration dictionary or path to a config file |
remote |
Optional[str] |
None |
Name of the DVC remote to use |
remote_config |
Optional[dict] |
None |
Additional remote-specific configuration |
Return Value
Returns a list of tuples where each tuple contains:
| Index | Type | Description |
|---|---|---|
| 0 | str |
The path of the entry |
| 1 | int |
The total disk usage in bytes for that entry |
When summarize=True or the target path is a file, a single-element list is returned. When the target is a directory, the list contains one entry per child plus a final summary entry with the total.
Internal Mechanics
Configuration Loading
If config is provided as a string (file path) rather than a dictionary, it is loaded using Config.load_file:
from dvc.config import Config
if config and not isinstance(config, dict):
config_dict = Config.load_file(config)
else:
config_dict = None
Repository Access
The function opens the repository using Repo.open with subrepos=True and uninitialized=True, allowing it to operate on repositories that may contain sub-repositories or that lack full initialization:
with Repo.open(
url,
rev=rev,
subrepos=True,
uninitialized=True,
config=config_dict,
remote=remote,
remote_config=remote_config,
) as repo:
Size Calculation
The function uses the repo.dvcfs (DVC filesystem) to calculate sizes:
- For files or when
summarize=True: callsfs.du(path, total=True)once. - For directories: iterates over
fs.ls(path), calculates the size for each child entry, then appends a summary total.
Usage Example
from dvc.repo.du import du
# Get disk usage for a remote repository path
entries = du("https://github.com/example/repo", path="data/")
# Summarized usage for a specific revision
entries = du("/path/to/repo", path="models/", rev="v1.0", summarize=True)
for path, size in entries:
print(f"{path}: {size} bytes")
Dependencies
| Module | Purpose |
|---|---|
dvc.config.Config |
Loading configuration from file paths |
dvc.repo.Repo |
Opening and interacting with the DVC repository |
repo.dvcfs |
The DVC filesystem used for du, ls, and isdir operations
|