Implementation:Iterative Dvc Output Save
| Knowledge Sources | |
|---|---|
| Domains | Data_Versioning, Cryptographic_Hashing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for computing content hashes and collecting metadata for DVC-tracked outputs, provided by the DVC library.
Description
The Output.save method in DVC's dvc/output.py module is responsible for computing the content hash of a tracked file or directory and populating the output object's metadata attributes. It is the primary entry point for the content hashing phase of the data tracking workflow.
When save() is called, it first validates that the output exists on disk and is either a regular file or a directory. It then adds the output path to .gitignore (via self.ignore()) so that Git does not track the actual data. Next, it delegates to the internal _build method, which calls the dvc_data library's build function to traverse the file or directory, compute MD5 hashes for every file, and construct a HashFile object (or a Tree object for directories, which is a Merkle-tree-like manifest of all contained files and their hashes).
The _build method accepts a hash file database (odb), a filesystem path, a filesystem object, and a hash algorithm name. It streams file contents through the hash function with progress reporting and returns a tuple of (HashFileDB, Meta, HashFile) where Meta contains size and file count information, and HashFile carries the computed hash_info. After _build completes, save() assigns self.meta, self.obj, and self.hash_info on the output object, making the computed hash available for subsequent serialization to metafiles and transfer to cache.
Usage
Use Output.save when you need to compute or refresh the content hash of a DVC output. This is called internally by the dvc add workflow (through Stage.save) and by dvc commit when updating hashes for pipeline outputs. It is also useful when building custom tooling that needs to detect changes to tracked files by comparing current hashes against previously recorded values.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/output.py - Lines: L682-727 (save), L538-550 (_build)
Signature
class Output:
def save(self) -> None:
"""Compute content hash and populate hash_info, meta, and obj."""
...
def _build(
self,
*args,
no_progress_bar: bool = False,
**kwargs,
) -> tuple["HashFileDB", "Meta", "HashFile"]:
"""Build a HashFile object by computing hashes of file contents.
Delegates to dvc_data.build() with progress tracking.
"""
...
Import
from dvc.output import Output
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| self | Output |
Yes | The Output instance representing a tracked file or directory. Must have fs (filesystem), fs_path (absolute path), hash_name (hash algorithm, typically "md5"), cache (HashFileDB for content-addressable storage), and use_cache (bool) attributes populated. |
| no_progress_bar | bool |
No | When True, suppresses the progress bar during hash computation. Defaults to False. Only applies to _build. |
Outputs
| Name | Type | Description |
|---|---|---|
| self.hash_info | HashInfo |
Populated with the computed content hash. Contains value (the hex digest string, e.g., "d41d8cd98f00b204e9800998ecf8427e"), name (the hash algorithm), and isdir (True for directories). |
| self.meta | Meta |
Populated with file metadata: size (total bytes), nfiles (number of files, for directories), and isexec (whether the file is executable). |
| self.obj | HashFile |
The built hash file object. For files, a single HashFile; for directories, a Tree object containing the manifest of all files and their individual hashes. |
| (_build return) | tuple[HashFileDB, Meta, HashFile] |
A tuple of the staging database used during the build, the computed metadata, and the hash file object. Used internally by save() and other methods. |
Usage Examples
Basic Usage
from dvc.repo import Repo
# Open a DVC repository and get a tracked output
repo = Repo()
stages = list(repo.index.stages)
# Access the first output of the first stage
stage = stages[0]
out = stage.outs[0]
# Save (compute hash) for the output
out.save()
# Inspect the computed hash
print(f"Hash: {out.hash_info.value}")
print(f"Algorithm: {out.hash_info.name}")
print(f"Is directory: {out.hash_info.isdir}")
print(f"Size: {out.meta.size} bytes")
# For a directory output, inspect the Merkle tree
if out.hash_info.isdir:
print(f"Number of files: {out.meta.nfiles}")
for entry in out.obj:
ikey, _, oid = entry
print(f" {'/'.join(ikey)}: {oid.value}")