Workflow:Iterative Dvc Data Tracking
| Knowledge Sources | |
|---|---|
| Domains | Data_Versioning, MLOps |
| Last Updated | 2026-02-10 10:30 GMT |
Overview
End-to-end process for tracking data files and directories under DVC version control, storing content-addressed copies in the local cache while keeping lightweight pointer files (`.dvc` files) in Git.
Description
This workflow covers the standard procedure for adding data files or directories to DVC tracking. When a file is added, DVC computes its content hash, moves or copies the data into the local cache (`.dvc/cache`), and creates a `.dvc` metadata file that records the hash, size, and path. The `.dvc` file is small enough to commit to Git, while the actual data remains outside Git's purview. This enables Git-based versioning of arbitrarily large datasets and model files.
Goal: A `.dvc` pointer file committed to Git, with the original data cached locally.
Scope: From raw workspace files to cached, Git-trackable state.
Strategy: Content-addressable storage with hash-based deduplication and configurable cache link types (symlink, hardlink, copy, or reflink).
Usage
Execute this workflow when you have data files (datasets, model weights, large binaries) in your workspace that you want to version-control without storing them directly in Git. Typical triggers include:
- Adding a new dataset to the project for the first time
- Updating an existing tracked dataset with new content
- Tracking model artifacts produced outside a DVC pipeline
Execution Steps
Step 1: Resolve Targets
Identify the files or directories to be tracked. DVC expands glob patterns if provided and resolves each target path to an absolute workspace location. For each target, DVC checks whether it already corresponds to an existing tracked output (updating an existing `.dvc` file) or whether a new stage must be created.
Key considerations:
- Glob expansion is optional and must be explicitly enabled
- If the target overlaps with a pipeline-tracked output, DVC rejects the operation and advises using `dvc commit` instead
- Multiple targets can be added in a single invocation
Step 2: Validate Dependency Graph
Before modifying any files, DVC validates that the new outputs do not conflict with the existing pipeline graph. This check ensures no overlapping output paths exist and no duplicate outputs are registered across stages.
Key considerations:
- Overlapping parent-child output paths are rejected with specific guidance
- The full repository index graph is consulted for validation
- This step prevents data corruption from conflicting stage definitions
Step 3: Compute Content Hash
For each target file or directory, DVC computes a content-addressable hash. Files are hashed individually; directories are hashed by computing a manifest of all contained files and their individual hashes. The hash algorithm defaults to MD5 for legacy compatibility and a newer algorithm for current versions.
Key considerations:
- Directory hashing involves recursively hashing all contained files
- The `.dvcignore` file controls which files within directories are excluded
- Hash computation respects the configured hash algorithm
Step 4: Transfer to Cache
The original file content is transferred from the workspace to the local cache directory (`.dvc/cache`). The cache uses content-addressable storage, meaning files are stored under paths derived from their hash. The transfer mechanism depends on the configured link type.
Key considerations:
- Link types include reflink (copy-on-write), hardlink, symlink, and copy
- Reflink is preferred where supported as it is both fast and space-efficient
- If linking fails, DVC falls back to copying and warns the user
- The `no_commit` flag can defer this step for batch operations
Step 5: Create or Update DVC Metafile
DVC writes a `.dvc` file (or updates an existing one) that records the output hash, file size, and relative path. This metafile serves as the pointer from Git to the cached data. For new targets, a corresponding `.gitignore` entry is also created to prevent Git from tracking the raw data file.
Key considerations:
- The `.dvc` file uses a YAML-based schema with version, output hash, size, and path fields
- Existing `.dvc` files are updated in-place when re-adding modified data
- The `.gitignore` entry is managed automatically via the SCM context
Step 6: Stage Git Changes
DVC automatically stages the newly created or modified `.dvc` files and any updated `.gitignore` files for the next Git commit. This is handled by the SCM context manager which batches all Git operations.
Key considerations:
- The user still needs to run `git commit` to finalize versioning
- If the `--to-remote` flag is used, data is transferred directly to a remote instead of the local cache