Implementation:Iterative Dvc Output Add
| Knowledge Sources | |
|---|---|
| Domains | Data_Versioning, Storage_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for hashing, caching, and relinking DVC-tracked output files between workspace and content-addressable cache, provided by the DVC library.
Description
The Output.add method in DVC's dvc/output.py module is the primary entry point for the cache transfer phase of the data tracking workflow. It orchestrates the full lifecycle of adding a data file to DVC: computing the content hash via _build, transferring the data into the content-addressable cache via otransfer, and then replacing the workspace copy with a cache-linked version via _checkout.
The method handles both file and directory outputs. For directories, if a specific sub-path is being added (e.g., adding a single file into an already-tracked directory), it applies the change to the existing Tree object via apply and updates the tree in cache via add_update_tree. For files being removed from a tracked directory, it calls unstage to update the Tree accordingly.
After the hash is computed and the output object's hash_info, meta, and obj attributes are populated, the method proceeds to the commit phase (unless no_commit=True is specified). The commit transfers data from the staging area to the permanent cache using otransfer, which handles deduplication -- if the hash already exists in cache, no data is physically copied. Finally, if relink=True, the workspace file is deleted and recreated as a link (hardlink, symlink, reflink, or copy depending on cache configuration) pointing to the cached version.
The companion methods commit (L752-800) and checkout (L939-986) handle the individual sub-operations. commit transfers a built hash object from staging to permanent cache and optionally relinks. checkout restores a workspace file from cache based on recorded hash information, used during dvc checkout operations.
Usage
Use Output.add when programmatically adding data files to DVC tracking. It is called internally by the dvc add command's _add helper function. Use Output.commit when a pipeline stage has completed and its outputs need to be persisted to cache. Use Output.checkout when restoring workspace files to match a specific recorded version.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/output.py - Lines: L1362-1449 (add), L752-800 (commit), L939-986 (checkout)
Signature
class Output:
def add(
self,
path: Optional[str] = None,
no_commit: bool = False,
relink: bool = True,
) -> Optional["HashFile"]:
"""Hash the output, transfer to cache, and relink workspace file.
Args:
path: Specific sub-path to add (for directory outputs).
Defaults to self.fs_path.
no_commit: If True, compute hash but skip cache transfer.
relink: If True, replace workspace file with a cache link.
Returns:
The HashFile object for the newly built content, or None.
"""
...
def commit(
self,
filter_info: Optional[str] = None,
relink: bool = True,
) -> None:
"""Transfer built hash object from staging to permanent cache."""
...
def checkout(
self,
force: bool = False,
progress_callback: "Callback" = DEFAULT_CALLBACK,
relink: bool = False,
filter_info: Optional[str] = None,
allow_missing: bool = False,
**kwargs,
) -> Optional[tuple[bool, Optional[bool]]]:
"""Restore workspace file from cache based on recorded hash."""
...
Import
from dvc.output import Output
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| self | Output |
Yes | The Output instance representing the tracked file or directory. Must have fs, fs_path, hash_name, cache (HashFileDB), and repo attributes populated. |
| path | Optional[str] |
No | An optional specific filesystem path to add. When provided and different from self.fs_path, the content at this path is merged into the output's existing Tree (for directory outputs). Defaults to self.fs_path. |
| no_commit | bool |
No | When True, the content hash is computed and the output metadata is updated, but data is not transferred to the permanent cache. Useful for staging changes without finalizing. Defaults to False. |
| relink | bool |
No | When True, after cache transfer the workspace file is replaced with a link to the cached version (using the configured link type). Defaults to True. |
| force | bool |
No | When True (in checkout), forces overwriting of workspace files even if they have been modified. Defaults to False. |
| filter_info | Optional[str] |
No | In commit and checkout, limits the operation to a specific sub-path within a directory output for granular updates. |
| allow_missing | bool |
No | In checkout, if True, returns None instead of raising CheckoutError when the cached object is not found. Defaults to False. |
Outputs
| Name | Type | Description |
|---|---|---|
| (add return) | Optional[HashFile] |
The HashFile object for the newly built content. Returns None in edge cases such as when unstaging a removed file from a directory. Side effects: self.hash_info, self.meta, and self.obj are updated; data is transferred to cache (unless no_commit); workspace file is relinked (unless relink=False). |
| (commit return) | None |
No return value. Side effect: data is transferred from staging database to permanent cache, and the workspace file is optionally relinked. |
| (checkout return) | Optional[tuple[bool, Optional[bool]]] |
A tuple of (added, modified) booleans. added is True if the file did not exist before checkout. modified is True if the file was changed, False if unchanged, or None in edge cases. Returns None if caching is disabled or the object is not found with allow_missing=True. |
Usage Examples
Basic Usage
from dvc.repo import Repo
repo = Repo()
# Simulate the dvc add workflow for a single file
stage_info = repo.stage.create(
single_stage=True,
fname="data.csv.dvc",
outs=["data.csv"],
)
stage = stage_info
out = stage.outs[0]
# Add the output: hash, transfer to cache, and relink
hash_file = out.add()
print(f"Cached hash: {out.hash_info.value}")
print(f"File size: {out.meta.size}")
# Later, commit after a pipeline run (update cache)
out.commit()
# Checkout a specific version from cache
result = out.checkout(force=True)
if result:
added, modified = result
print(f"Added: {added}, Modified: {modified}")