Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Dependency Dataset

From Leeroopedia


Knowledge Sources
Domains Dependency_Management, Dataset_Management
Last Updated 2026-02-10 10:00 GMT

Overview

DatasetDependency is a class defined in dvc/dependency/dataset.py (84 lines) that extends AbstractDependency. It manages dataset dependencies for DVC pipeline stages, allowing stages to declare dependencies on named datasets referenced via ds:// URIs.

from dvc.dependency.dataset import DatasetDependency

Source File

Property Value
File dvc/dependency/dataset.py
Lines 84
Class DatasetDependency
Extends AbstractDependency

Class: DatasetDependency

DatasetDependency inherits from AbstractDependency (imported from dvc.dependency.db) and represents a dependency on a DVC dataset. The dataset is identified by a ds:// URI scheme, where the netloc portion of the URL becomes the dataset name.

Class Attributes

Attribute Value Description
PARAM_DATASET "dataset" Key used in hash info and schema
DATASET_SCHEMA {PARAM_DATASET: dict} Schema definition for dataset dependencies

Constructor

def __init__(self, stage: "Stage", p, info, *args, **kwargs)

Initializes the dataset dependency by:

  • Calling super().__init__(stage, info) on AbstractDependency
  • Setting self.def_path to the raw path string p
  • Extracting the dataset name from the URL's netloc via urlparse(p).netloc
  • Constructing a HashInfo object using the "dataset" hash name and the dataset info dictionary

Methods

is_dataset (classmethod)

@classmethod
def is_dataset(cls, p: str) -> bool

Returns True if the given path string p uses the ds URI scheme, identifying it as a dataset dependency.

workspace_status

def workspace_status(self) -> dict

Checks synchronization status between the dependency's recorded hash info and the dataset's current lock state in the repository:

  • Returns {str(self): "not in sync"} if the dataset has no lock
  • Returns {str(self): "new"} if no lock can be derived from the stored info
  • Returns {str(self): "modified"} if the derived lock differs from the dataset lock
  • Returns an empty dict {} if everything is in sync

get_hash

def get_hash(self) -> HashInfo

Retrieves the current hash for the dataset dependency from the repository's dataset registry. Raises DvcException if the dataset lock information is missing or invalidated, prompting the user to run dvc ds update.

save

def save(self) -> None

Saves the dependency by updating self.hash_info with the result of get_hash().

dumpd

def dumpd(self, **kwargs) -> dict

Serializes the dependency to a dictionary containing the path and hash info, using funcy.compact to remove None values.

fill_values

def fill_values(self, values=None) -> None

Dynamically merges additional values into the existing hash info using funcy.merge. This allows parameter values to be loaded and combined at runtime.

download / update

Both download() and update() raise NotImplementedError, as dataset dependencies do not support direct downloading or updating through this interface.

Dependency Hierarchy

Dependency (dvc.dependency.base)
  +-- AbstractDependency (dvc.dependency.db)
        +-- DatasetDependency (dvc.dependency.dataset)

Key Dependencies

Module Usage
dvc_data.hashfile.hash_info.HashInfo Hash representation for dataset state
funcy.compact Removes None values from dicts during serialization
funcy.merge Merges dictionaries for fill_values()
urllib.parse.urlparse Parses ds:// URIs to extract dataset names
dvc.exceptions.DvcException Error handling for missing or invalid dataset info

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment