Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Api Dataset

From Leeroopedia


Knowledge Sources
Domains API, Dataset_Management
Last Updated 2026-02-10 10:00 GMT

Overview

Public API function for retrieving dataset definitions with lock information from a DVC repository, provided by the DVC library.

Description

The dvc/api/dataset.py module (67 lines) exposes a single public function, get, that retrieves a named dataset from the current DVC repository and returns a typed dictionary containing the dataset's resolved location and version details. The function supports three dataset types, each represented by a dedicated TypedDict.

DatachainDataset (type "dc") represents datasets managed by DataChain, returning the dataset name and version number. DVCDataset (type "dvc") represents datasets tracked as DVC remotes, returning the URL, path, and locked Git SHA. URLDataset (type "url") represents datasets referenced by URL, returning a list of versioned file URLs and the base path.

The get function opens the current DVC repository, looks up the dataset by name in repo.datasets, validates that the dataset is in sync and has lock information, and dispatches to the appropriate return type based on dataset.type. If the dataset name is not found, the function suggests close matches using difflib.get_close_matches. For URL-type datasets, the function resolves cloud filesystem classes and applies version IDs to construct fully-qualified URLs for each file.

Usage

Use get when you need to programmatically resolve a dataset's location and version for downstream consumption -- for example, to pass dataset coordinates to a training script, to verify that a dataset is properly locked before a production run, or to construct direct download URLs for URL-type datasets.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/api/dataset.py
  • Lines: L1-67

Signature

class DatachainDataset(TypedDict):
    type: Literal["dc"]
    name: str
    version: int


class DVCDataset(TypedDict):
    type: Literal["dvc"]
    url: str
    path: str
    sha: str


class URLDataset(TypedDict):
    type: Literal["url"]
    files: list[str]
    path: str


def get(name: str) -> Union[DatachainDataset, DVCDataset, URLDataset]:
    """Retrieve a dataset by name with lock info.

    Args:
        name (str): name of the dataset to retrieve.

    Returns:
        One of DatachainDataset, DVCDataset, or URLDataset.

    Raises:
        DatasetNotFoundError: If no dataset matches the given name.
        ValueError: If the dataset is not in sync or missing lock info.
    """
    ...

Import

from dvc.api.dataset import get

I/O Contract

Inputs

Name Type Required Description
name str Yes Name of the dataset to retrieve from the current DVC repository. If the name is not found, the error message will suggest close matches.

Outputs

Return Type Condition Fields
DatachainDataset dataset.type == "dc" type ("dc"), name (str), version (int)
DVCDataset dataset.type == "dvc" type ("dvc"), url (str), path (str), sha (str)
URLDataset dataset.type == "url" type ("url"), files (list[str]), path (str)

Usage Examples

Basic Usage

from dvc.api.dataset import get

# Retrieve a DVC-tracked dataset
ds = get("training-data")
if ds["type"] == "dvc":
    print(f"URL: {ds['url']}")
    print(f"Path: {ds['path']}")
    print(f"Locked SHA: {ds['sha']}")

# Retrieve a DataChain dataset
ds = get("my-dc-dataset")
if ds["type"] == "dc":
    print(f"Name: {ds['name']}, Version: {ds['version']}")

# Retrieve a URL-based dataset
ds = get("external-dataset")
if ds["type"] == "url":
    print(f"Base path: {ds['path']}")
    for file_url in ds["files"]:
        print(f"  File: {file_url}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment