Implementation:Iterative Dvc Repo Datasets
Domains
Dataset_Management, Data_Versioning
Overview
Concrete tool for managing versioned dataset definitions, locks, and types in DVC projects. The module dvc/repo/datasets.py provides the Datasets class, which implements the Mapping protocol for named datasets, along with three dataset type families: DVC (repo-based), Datachain (catalog-based), and URL.
Description
The Datasets class acts as a read/write mapping over all datasets defined in a DVC project. Each dataset has two components:
- Spec -- the dataset definition as written in
dvc.yaml(name, url, type, and type-specific fields). - Lock -- a frozen version snapshot stored in
dvc.lockthat pins the dataset to a specific revision, version, or file metadata.
Dataset types:
| Type | Spec Class | Lock Class | Dataset Class | Description |
|---|---|---|---|---|
dvc |
DVCDatasetSpec |
DVCDatasetLock |
DVCDataset |
Repository-based dataset. Spec includes path and optional rev. Lock adds rev_lock (resolved commit hash).
|
dc |
DatasetSpec |
DatachainDatasetLock |
DatachainDataset |
Catalog-based dataset from Datachain. Lock adds version (int) and created_at (datetime).
|
url |
DatasetSpec |
URLDatasetLock |
URLDataset |
Direct URL dataset. Lock adds meta (Meta) and optional files (list[FileInfo]).
|
All spec and lock classes use @frozen attrs dataclasses with keyword-only initialization. They inherit from a SerDe mixin that provides to_dict() and from_dict() serialization methods with default filtering and datetime ISO-format handling.
Invalidation logic: When the lock data does not match the current spec (detected by converting the lock back to a spec via to_spec()), the lock is invalidated and set to None, with the _invalidated flag set to True. This signals that dvc repro or dvc status should re-resolve the dataset.
Signature
class Datasets(Mapping[str, Dataset]):
def __init__(self, repo: "Repo") -> None: ...
def __getitem__(self, name: str) -> Dataset: ...
def __iter__(self) -> Iterator[str]: ...
def __len__(self) -> int: ...
def add(
self,
name: str,
url: str,
type: str,
manifest_path: StrPath = "dvc.yaml",
**kwargs: Any,
) -> Dataset: ...
def update(self, name, **kwargs) -> tuple[Dataset, Dataset]: ...
def dump(self, dataset: Dataset, old: Optional[Dataset] = None) -> None: ...
Dataset update methods:
# DVCDataset.update resolves the revision lock via RepoDependency
class DVCDataset:
def update(self, repo, rev: Optional[str] = None, **kwargs) -> "Self": ...
# DatachainDataset.update fetches version info from the catalog
class DatachainDataset:
def update(self, repo, record=None, version=None, **kwargs) -> "Self": ...
# URLDataset.update saves dependency metadata and file info
class URLDataset:
def update(self, repo, **kwargs) -> "Self": ...
Import
from dvc.repo.datasets import Datasets
Input/Output
| Method | Input | Output |
|---|---|---|
add() |
name: str, url: str, type: str (one of "dvc", "dc", "url"), manifest_path: StrPath (default "dvc.yaml"), plus type-specific kwargs |
Dataset -- the newly created and locked dataset (one of DVCDataset, DatachainDataset, URLDataset)
|
update() |
name: str, optional version: int or rev: str depending on type |
tuple[Dataset, Dataset] -- (old dataset, new dataset)
|
dump() |
dataset: Dataset, optional old: Dataset for change detection |
None -- writes spec to dvc.yaml and lock to dvc.lock
|
__getitem__() |
name: str |
Dataset -- raises DatasetNotFoundError if missing
|
Example
from dvc.repo import Repo
from dvc.repo.datasets import Datasets
with Repo() as repo:
datasets = Datasets(repo)
# Add a DVC-type dataset
ds = datasets.add(
name="training-data",
url="https://github.com/org/data-repo",
type="dvc",
path="datasets/train",
rev="main",
)
print(f"Locked at revision: {ds.lock.rev_lock}")
# Update an existing dataset
old, new = datasets.update("training-data", rev="v2.0")
print(f"Updated from {old.lock.rev_lock} to {new.lock.rev_lock}")
# Iterate all datasets
for name in datasets:
print(f"Dataset: {name}, type: {datasets[name].type}")