Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Repo Datasets

From Leeroopedia


Domains

Dataset_Management, Data_Versioning

Overview

Concrete tool for managing versioned dataset definitions, locks, and types in DVC projects. The module dvc/repo/datasets.py provides the Datasets class, which implements the Mapping protocol for named datasets, along with three dataset type families: DVC (repo-based), Datachain (catalog-based), and URL.

Description

The Datasets class acts as a read/write mapping over all datasets defined in a DVC project. Each dataset has two components:

  • Spec -- the dataset definition as written in dvc.yaml (name, url, type, and type-specific fields).
  • Lock -- a frozen version snapshot stored in dvc.lock that pins the dataset to a specific revision, version, or file metadata.

Dataset types:

Type Spec Class Lock Class Dataset Class Description
dvc DVCDatasetSpec DVCDatasetLock DVCDataset Repository-based dataset. Spec includes path and optional rev. Lock adds rev_lock (resolved commit hash).
dc DatasetSpec DatachainDatasetLock DatachainDataset Catalog-based dataset from Datachain. Lock adds version (int) and created_at (datetime).
url DatasetSpec URLDatasetLock URLDataset Direct URL dataset. Lock adds meta (Meta) and optional files (list[FileInfo]).

All spec and lock classes use @frozen attrs dataclasses with keyword-only initialization. They inherit from a SerDe mixin that provides to_dict() and from_dict() serialization methods with default filtering and datetime ISO-format handling.

Invalidation logic: When the lock data does not match the current spec (detected by converting the lock back to a spec via to_spec()), the lock is invalidated and set to None, with the _invalidated flag set to True. This signals that dvc repro or dvc status should re-resolve the dataset.

Signature

class Datasets(Mapping[str, Dataset]):
    def __init__(self, repo: "Repo") -> None: ...

    def __getitem__(self, name: str) -> Dataset: ...

    def __iter__(self) -> Iterator[str]: ...

    def __len__(self) -> int: ...

    def add(
        self,
        name: str,
        url: str,
        type: str,
        manifest_path: StrPath = "dvc.yaml",
        **kwargs: Any,
    ) -> Dataset: ...

    def update(self, name, **kwargs) -> tuple[Dataset, Dataset]: ...

    def dump(self, dataset: Dataset, old: Optional[Dataset] = None) -> None: ...

Dataset update methods:

# DVCDataset.update resolves the revision lock via RepoDependency
class DVCDataset:
    def update(self, repo, rev: Optional[str] = None, **kwargs) -> "Self": ...

# DatachainDataset.update fetches version info from the catalog
class DatachainDataset:
    def update(self, repo, record=None, version=None, **kwargs) -> "Self": ...

# URLDataset.update saves dependency metadata and file info
class URLDataset:
    def update(self, repo, **kwargs) -> "Self": ...

Import

from dvc.repo.datasets import Datasets

Input/Output

Method Input Output
add() name: str, url: str, type: str (one of "dvc", "dc", "url"), manifest_path: StrPath (default "dvc.yaml"), plus type-specific kwargs Dataset -- the newly created and locked dataset (one of DVCDataset, DatachainDataset, URLDataset)
update() name: str, optional version: int or rev: str depending on type tuple[Dataset, Dataset] -- (old dataset, new dataset)
dump() dataset: Dataset, optional old: Dataset for change detection None -- writes spec to dvc.yaml and lock to dvc.lock
__getitem__() name: str Dataset -- raises DatasetNotFoundError if missing

Example

from dvc.repo import Repo
from dvc.repo.datasets import Datasets

with Repo() as repo:
    datasets = Datasets(repo)

    # Add a DVC-type dataset
    ds = datasets.add(
        name="training-data",
        url="https://github.com/org/data-repo",
        type="dvc",
        path="datasets/train",
        rev="main",
    )
    print(f"Locked at revision: {ds.lock.rev_lock}")

    # Update an existing dataset
    old, new = datasets.update("training-data", rev="v2.0")
    print(f"Updated from {old.lock.rev_lock} to {new.lock.rev_lock}")

    # Iterate all datasets
    for name in datasets:
        print(f"Dataset: {name}, type: {datasets[name].type}")

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment