Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Iterative Dvc Data Tracking

From Leeroopedia


Knowledge Sources
Domains Data_Versioning, MLOps
Last Updated 2026-02-10 10:30 GMT

Overview

End-to-end process for tracking data files and directories under DVC version control, storing content-addressed copies in the local cache while keeping lightweight pointer files (`.dvc` files) in Git.

Description

This workflow covers the standard procedure for adding data files or directories to DVC tracking. When a file is added, DVC computes its content hash, moves or copies the data into the local cache (`.dvc/cache`), and creates a `.dvc` metadata file that records the hash, size, and path. The `.dvc` file is small enough to commit to Git, while the actual data remains outside Git's purview. This enables Git-based versioning of arbitrarily large datasets and model files.

Goal: A `.dvc` pointer file committed to Git, with the original data cached locally.

Scope: From raw workspace files to cached, Git-trackable state.

Strategy: Content-addressable storage with hash-based deduplication and configurable cache link types (symlink, hardlink, copy, or reflink).

Usage

Execute this workflow when you have data files (datasets, model weights, large binaries) in your workspace that you want to version-control without storing them directly in Git. Typical triggers include:

  • Adding a new dataset to the project for the first time
  • Updating an existing tracked dataset with new content
  • Tracking model artifacts produced outside a DVC pipeline

Execution Steps

Step 1: Resolve Targets

Identify the files or directories to be tracked. DVC expands glob patterns if provided and resolves each target path to an absolute workspace location. For each target, DVC checks whether it already corresponds to an existing tracked output (updating an existing `.dvc` file) or whether a new stage must be created.

Key considerations:

  • Glob expansion is optional and must be explicitly enabled
  • If the target overlaps with a pipeline-tracked output, DVC rejects the operation and advises using `dvc commit` instead
  • Multiple targets can be added in a single invocation

Step 2: Validate Dependency Graph

Before modifying any files, DVC validates that the new outputs do not conflict with the existing pipeline graph. This check ensures no overlapping output paths exist and no duplicate outputs are registered across stages.

Key considerations:

  • Overlapping parent-child output paths are rejected with specific guidance
  • The full repository index graph is consulted for validation
  • This step prevents data corruption from conflicting stage definitions

Step 3: Compute Content Hash

For each target file or directory, DVC computes a content-addressable hash. Files are hashed individually; directories are hashed by computing a manifest of all contained files and their individual hashes. The hash algorithm defaults to MD5 for legacy compatibility and a newer algorithm for current versions.

Key considerations:

  • Directory hashing involves recursively hashing all contained files
  • The `.dvcignore` file controls which files within directories are excluded
  • Hash computation respects the configured hash algorithm

Step 4: Transfer to Cache

The original file content is transferred from the workspace to the local cache directory (`.dvc/cache`). The cache uses content-addressable storage, meaning files are stored under paths derived from their hash. The transfer mechanism depends on the configured link type.

Key considerations:

  • Link types include reflink (copy-on-write), hardlink, symlink, and copy
  • Reflink is preferred where supported as it is both fast and space-efficient
  • If linking fails, DVC falls back to copying and warns the user
  • The `no_commit` flag can defer this step for batch operations

Step 5: Create or Update DVC Metafile

DVC writes a `.dvc` file (or updates an existing one) that records the output hash, file size, and relative path. This metafile serves as the pointer from Git to the cached data. For new targets, a corresponding `.gitignore` entry is also created to prevent Git from tracking the raw data file.

Key considerations:

  • The `.dvc` file uses a YAML-based schema with version, output hash, size, and path fields
  • Existing `.dvc` files are updated in-place when re-adding modified data
  • The `.gitignore` entry is managed automatically via the SCM context

Step 6: Stage Git Changes

DVC automatically stages the newly created or modified `.dvc` files and any updated `.gitignore` files for the next Git commit. This is handled by the SCM context manager which batches all Git operations.

Key considerations:

  • The user still needs to run `git commit` to finalize versioning
  • If the `--to-remote` flag is used, data is transferred directly to a remote instead of the local cache

Execution Diagram

GitHub URL

Workflow Repository