Implementation:Iterative Dvc Repo Datasets

Domains

Overview

Concrete tool for managing versioned dataset definitions, locks, and types in DVC projects. The module dvc/repo/datasets.py provides the Datasets class, which implements the Mapping protocol for named datasets, along with three dataset type families: DVC (repo-based), Datachain (catalog-based), and URL.

Description

The Datasets class acts as a read/write mapping over all datasets defined in a DVC project. Each dataset has two components:

Spec -- the dataset definition as written in dvc.yaml (name, url, type, and type-specific fields).
Lock -- a frozen version snapshot stored in dvc.lock that pins the dataset to a specific revision, version, or file metadata.

Dataset types:

Type	Spec Class	Lock Class	Dataset Class	Description
`dvc`	`DVCDatasetSpec`	`DVCDatasetLock`	`DVCDataset`	Repository-based dataset. Spec includes `path` and optional `rev`. Lock adds `rev_lock` (resolved commit hash).
`dc`	`DatasetSpec`	`DatachainDatasetLock`	`DatachainDataset`	Catalog-based dataset from Datachain. Lock adds `version` (int) and `created_at` (datetime).
`url`	`DatasetSpec`	`URLDatasetLock`	`URLDataset`	Direct URL dataset. Lock adds `meta` (Meta) and optional `files` (list[FileInfo]).

All spec and lock classes use @frozen attrs dataclasses with keyword-only initialization. They inherit from a SerDe mixin that provides to_dict() and from_dict() serialization methods with default filtering and datetime ISO-format handling.

Invalidation logic: When the lock data does not match the current spec (detected by converting the lock back to a spec via to_spec()), the lock is invalidated and set to None, with the _invalidated flag set to True. This signals that dvc repro or dvc status should re-resolve the dataset.

Signature

class Datasets(Mapping[str, Dataset]):
    def __init__(self, repo: "Repo") -> None: ...

    def __getitem__(self, name: str) -> Dataset: ...

    def __iter__(self) -> Iterator[str]: ...

    def __len__(self) -> int: ...

    def add(
        self,
        name: str,
        url: str,
        type: str,
        manifest_path: StrPath = "dvc.yaml",
        **kwargs: Any,
    ) -> Dataset: ...

    def update(self, name, **kwargs) -> tuple[Dataset, Dataset]: ...

    def dump(self, dataset: Dataset, old: Optional[Dataset] = None) -> None: ...

Dataset update methods:

# DVCDataset.update resolves the revision lock via RepoDependency
class DVCDataset:
    def update(self, repo, rev: Optional[str] = None, **kwargs) -> "Self": ...

# DatachainDataset.update fetches version info from the catalog
class DatachainDataset:
    def update(self, repo, record=None, version=None, **kwargs) -> "Self": ...

# URLDataset.update saves dependency metadata and file info
class URLDataset:
    def update(self, repo, **kwargs) -> "Self": ...

Import

from dvc.repo.datasets import Datasets

Input/Output

Method	Input	Output
`add()`	`name: str`, `url: str`, `type: str` (one of `"dvc"`, `"dc"`, `"url"`), `manifest_path: StrPath` (default `"dvc.yaml"`), plus type-specific kwargs	`Dataset` -- the newly created and locked dataset (one of `DVCDataset`, `DatachainDataset`, `URLDataset`)
`update()`	`name: str`, optional `version: int` or `rev: str` depending on type	`tuple[Dataset, Dataset]` -- (old dataset, new dataset)
`dump()`	`dataset: Dataset`, optional `old: Dataset` for change detection	`None` -- writes spec to `dvc.yaml` and lock to `dvc.lock`
`__getitem__()`	`name: str`	`Dataset` -- raises `DatasetNotFoundError` if missing

Example

from dvc.repo import Repo
from dvc.repo.datasets import Datasets

with Repo() as repo:
    datasets = Datasets(repo)

    # Add a DVC-type dataset
    ds = datasets.add(
        name="training-data",
        url="https://github.com/org/data-repo",
        type="dvc",
        path="datasets/train",
        rev="main",
    )
    print(f"Locked at revision: {ds.lock.rev_lock}")

    # Update an existing dataset
    old, new = datasets.update("training-data", rev="v2.0")
    print(f"Updated from {old.lock.rev_lock} to {new.lock.rev_lock}")

    # Iterate all datasets
    for name in datasets:
        print(f"Dataset: {name}, type: {datasets[name].type}")

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment