Implementation:Iterative Dvc Api Dataset
| Knowledge Sources | |
|---|---|
| Domains | API, Dataset_Management |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Public API function for retrieving dataset definitions with lock information from a DVC repository, provided by the DVC library.
Description
The dvc/api/dataset.py module (67 lines) exposes a single public function, get, that retrieves a named dataset from the current DVC repository and returns a typed dictionary containing the dataset's resolved location and version details. The function supports three dataset types, each represented by a dedicated TypedDict.
DatachainDataset (type "dc") represents datasets managed by DataChain, returning the dataset name and version number. DVCDataset (type "dvc") represents datasets tracked as DVC remotes, returning the URL, path, and locked Git SHA. URLDataset (type "url") represents datasets referenced by URL, returning a list of versioned file URLs and the base path.
The get function opens the current DVC repository, looks up the dataset by name in repo.datasets, validates that the dataset is in sync and has lock information, and dispatches to the appropriate return type based on dataset.type. If the dataset name is not found, the function suggests close matches using difflib.get_close_matches. For URL-type datasets, the function resolves cloud filesystem classes and applies version IDs to construct fully-qualified URLs for each file.
Usage
Use get when you need to programmatically resolve a dataset's location and version for downstream consumption -- for example, to pass dataset coordinates to a training script, to verify that a dataset is properly locked before a production run, or to construct direct download URLs for URL-type datasets.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/api/dataset.py - Lines: L1-67
Signature
class DatachainDataset(TypedDict):
type: Literal["dc"]
name: str
version: int
class DVCDataset(TypedDict):
type: Literal["dvc"]
url: str
path: str
sha: str
class URLDataset(TypedDict):
type: Literal["url"]
files: list[str]
path: str
def get(name: str) -> Union[DatachainDataset, DVCDataset, URLDataset]:
"""Retrieve a dataset by name with lock info.
Args:
name (str): name of the dataset to retrieve.
Returns:
One of DatachainDataset, DVCDataset, or URLDataset.
Raises:
DatasetNotFoundError: If no dataset matches the given name.
ValueError: If the dataset is not in sync or missing lock info.
"""
...
Import
from dvc.api.dataset import get
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str |
Yes | Name of the dataset to retrieve from the current DVC repository. If the name is not found, the error message will suggest close matches. |
Outputs
| Return Type | Condition | Fields |
|---|---|---|
DatachainDataset |
dataset.type == "dc" |
type ("dc"), name (str), version (int)
|
DVCDataset |
dataset.type == "dvc" |
type ("dvc"), url (str), path (str), sha (str)
|
URLDataset |
dataset.type == "url" |
type ("url"), files (list[str]), path (str)
|
Usage Examples
Basic Usage
from dvc.api.dataset import get
# Retrieve a DVC-tracked dataset
ds = get("training-data")
if ds["type"] == "dvc":
print(f"URL: {ds['url']}")
print(f"Path: {ds['path']}")
print(f"Locked SHA: {ds['sha']}")
# Retrieve a DataChain dataset
ds = get("my-dc-dataset")
if ds["type"] == "dc":
print(f"Name: {ds['name']}, Version: {ds['version']}")
# Retrieve a URL-based dataset
ds = get("external-dataset")
if ds["type"] == "url":
print(f"Base path: {ds['path']}")
for file_url in ds["files"]:
print(f" File: {file_url}")