Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Iterative Dvc Output Save

From Leeroopedia


Knowledge Sources
Domains Data_Versioning, Cryptographic_Hashing
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for computing content hashes and collecting metadata for DVC-tracked outputs, provided by the DVC library.

Description

The Output.save method in DVC's dvc/output.py module is responsible for computing the content hash of a tracked file or directory and populating the output object's metadata attributes. It is the primary entry point for the content hashing phase of the data tracking workflow.

When save() is called, it first validates that the output exists on disk and is either a regular file or a directory. It then adds the output path to .gitignore (via self.ignore()) so that Git does not track the actual data. Next, it delegates to the internal _build method, which calls the dvc_data library's build function to traverse the file or directory, compute MD5 hashes for every file, and construct a HashFile object (or a Tree object for directories, which is a Merkle-tree-like manifest of all contained files and their hashes).

The _build method accepts a hash file database (odb), a filesystem path, a filesystem object, and a hash algorithm name. It streams file contents through the hash function with progress reporting and returns a tuple of (HashFileDB, Meta, HashFile) where Meta contains size and file count information, and HashFile carries the computed hash_info. After _build completes, save() assigns self.meta, self.obj, and self.hash_info on the output object, making the computed hash available for subsequent serialization to metafiles and transfer to cache.

Usage

Use Output.save when you need to compute or refresh the content hash of a DVC output. This is called internally by the dvc add workflow (through Stage.save) and by dvc commit when updating hashes for pipeline outputs. It is also useful when building custom tooling that needs to detect changes to tracked files by comparing current hashes against previously recorded values.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/output.py
  • Lines: L682-727 (save), L538-550 (_build)

Signature

class Output:
    def save(self) -> None:
        """Compute content hash and populate hash_info, meta, and obj."""
        ...

    def _build(
        self,
        *args,
        no_progress_bar: bool = False,
        **kwargs,
    ) -> tuple["HashFileDB", "Meta", "HashFile"]:
        """Build a HashFile object by computing hashes of file contents.

        Delegates to dvc_data.build() with progress tracking.
        """
        ...

Import

from dvc.output import Output

I/O Contract

Inputs

Name Type Required Description
self Output Yes The Output instance representing a tracked file or directory. Must have fs (filesystem), fs_path (absolute path), hash_name (hash algorithm, typically "md5"), cache (HashFileDB for content-addressable storage), and use_cache (bool) attributes populated.
no_progress_bar bool No When True, suppresses the progress bar during hash computation. Defaults to False. Only applies to _build.

Outputs

Name Type Description
self.hash_info HashInfo Populated with the computed content hash. Contains value (the hex digest string, e.g., "d41d8cd98f00b204e9800998ecf8427e"), name (the hash algorithm), and isdir (True for directories).
self.meta Meta Populated with file metadata: size (total bytes), nfiles (number of files, for directories), and isexec (whether the file is executable).
self.obj HashFile The built hash file object. For files, a single HashFile; for directories, a Tree object containing the manifest of all files and their individual hashes.
(_build return) tuple[HashFileDB, Meta, HashFile] A tuple of the staging database used during the build, the computed metadata, and the hash file object. Used internally by save() and other methods.

Usage Examples

Basic Usage

from dvc.repo import Repo

# Open a DVC repository and get a tracked output
repo = Repo()
stages = list(repo.index.stages)

# Access the first output of the first stage
stage = stages[0]
out = stage.outs[0]

# Save (compute hash) for the output
out.save()

# Inspect the computed hash
print(f"Hash: {out.hash_info.value}")
print(f"Algorithm: {out.hash_info.name}")
print(f"Is directory: {out.hash_info.isdir}")
print(f"Size: {out.meta.size} bytes")

# For a directory output, inspect the Merkle tree
if out.hash_info.isdir:
    print(f"Number of files: {out.meta.nfiles}")
    for entry in out.obj:
        ikey, _, oid = entry
        print(f"  {'/'.join(ikey)}: {oid.value}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment