Implementation:Iterative Dvc Dependency Dataset
| Knowledge Sources | |
|---|---|
| Domains | Dependency_Management, Dataset_Management |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
DatasetDependency is a class defined in dvc/dependency/dataset.py (84 lines) that extends AbstractDependency. It manages dataset dependencies for DVC pipeline stages, allowing stages to declare dependencies on named datasets referenced via ds:// URIs.
from dvc.dependency.dataset import DatasetDependency
Source File
| Property | Value |
|---|---|
| File | dvc/dependency/dataset.py
|
| Lines | 84 |
| Class | DatasetDependency
|
| Extends | AbstractDependency
|
Class: DatasetDependency
DatasetDependency inherits from AbstractDependency (imported from dvc.dependency.db) and represents a dependency on a DVC dataset. The dataset is identified by a ds:// URI scheme, where the netloc portion of the URL becomes the dataset name.
Class Attributes
| Attribute | Value | Description |
|---|---|---|
PARAM_DATASET |
"dataset" |
Key used in hash info and schema |
DATASET_SCHEMA |
{PARAM_DATASET: dict} |
Schema definition for dataset dependencies |
Constructor
def __init__(self, stage: "Stage", p, info, *args, **kwargs)
Initializes the dataset dependency by:
- Calling
super().__init__(stage, info)on AbstractDependency - Setting
self.def_pathto the raw path stringp - Extracting the dataset name from the URL's netloc via
urlparse(p).netloc - Constructing a
HashInfoobject using the"dataset"hash name and the dataset info dictionary
Methods
is_dataset (classmethod)
@classmethod
def is_dataset(cls, p: str) -> bool
Returns True if the given path string p uses the ds URI scheme, identifying it as a dataset dependency.
workspace_status
def workspace_status(self) -> dict
Checks synchronization status between the dependency's recorded hash info and the dataset's current lock state in the repository:
- Returns
{str(self): "not in sync"}if the dataset has no lock - Returns
{str(self): "new"}if no lock can be derived from the stored info - Returns
{str(self): "modified"}if the derived lock differs from the dataset lock - Returns an empty dict
{}if everything is in sync
get_hash
def get_hash(self) -> HashInfo
Retrieves the current hash for the dataset dependency from the repository's dataset registry. Raises DvcException if the dataset lock information is missing or invalidated, prompting the user to run dvc ds update.
save
def save(self) -> None
Saves the dependency by updating self.hash_info with the result of get_hash().
dumpd
def dumpd(self, **kwargs) -> dict
Serializes the dependency to a dictionary containing the path and hash info, using funcy.compact to remove None values.
fill_values
def fill_values(self, values=None) -> None
Dynamically merges additional values into the existing hash info using funcy.merge. This allows parameter values to be loaded and combined at runtime.
download / update
Both download() and update() raise NotImplementedError, as dataset dependencies do not support direct downloading or updating through this interface.
Dependency Hierarchy
Dependency (dvc.dependency.base)
+-- AbstractDependency (dvc.dependency.db)
+-- DatasetDependency (dvc.dependency.dataset)
Key Dependencies
| Module | Usage |
|---|---|
dvc_data.hashfile.hash_info.HashInfo |
Hash representation for dataset state |
funcy.compact |
Removes None values from dicts during serialization
|
funcy.merge |
Merges dictionaries for fill_values()
|
urllib.parse.urlparse |
Parses ds:// URIs to extract dataset names
|
dvc.exceptions.DvcException |
Error handling for missing or invalid dataset info |
See Also
- Implementation:Dependency_Db -- AbstractDependency base class and DbDependency
- Implementation:Dependency_Repo -- Cross-repository dependency handling