Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets AbstractDatasetReader

From Leeroopedia
Revision as of 12:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_AbstractDatasetReader.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains IO, Architecture
Last Updated 2026-02-14 18:00 GMT

Overview

Abstract base classes defining the interface for all dataset readers and input streams, provided by the HuggingFace Datasets library.

Description

AbstractDatasetReader and AbstractDatasetInputStream are abstract base classes (ABCs) that establish the contract for reading datasets from file paths or input streams respectively. All concrete reader implementations in the datasets.io package (such as CsvDatasetReader, JsonDatasetReader, ParquetDatasetReader) inherit from AbstractDatasetReader.

AbstractDatasetReader accepts file paths along with configuration for splits, features, caching, streaming, and multiprocessing. It defaults the split to "train" when a single path is provided (not a dict). Its abstract read() method returns a Dataset, DatasetDict, IterableDataset, or IterableDatasetDict.

AbstractDatasetInputStream is a simpler variant that does not accept path arguments, intended for stream-based inputs. Its abstract read() method returns a Dataset or IterableDataset.

Usage

These ABCs are not used directly by end users. They serve as the foundation for all format-specific reader classes in the library. Understanding them is useful for implementing custom readers or understanding the reader architecture.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/io/abc.py
  • Lines: 1-53

Signature

class AbstractDatasetReader(ABC):
    def __init__(
        self,
        path_or_paths: Optional[NestedDataStructureLike[PathLike]] = None,
        split: Optional[NamedSplit] = None,
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        streaming: bool = False,
        num_proc: Optional[int] = None,
        **kwargs,
    ):

    @abstractmethod
    def read(self) -> Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict]:
        pass
class AbstractDatasetInputStream(ABC):
    def __init__(
        self,
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        streaming: bool = False,
        num_proc: Optional[int] = None,
        **kwargs,
    ):

    @abstractmethod
    def read(self) -> Union[Dataset, IterableDataset]:
        pass

Import

from datasets.io.abc import AbstractDatasetReader, AbstractDatasetInputStream

I/O Contract

AbstractDatasetReader Inputs

Name Type Required Description
path_or_paths Optional[NestedDataStructureLike[PathLike]] No Path(s) to the data file(s). Can be a single path, list of paths, or dict mapping split names to paths.
split Optional[NamedSplit] No Name of the dataset split. Defaults to "train" when path_or_paths is not a dict.
features Optional[Features] No Explicit feature schema to apply to the loaded data.
cache_dir str No Directory for caching the processed dataset.
keep_in_memory bool No Whether to keep the dataset in memory instead of memory-mapping. Defaults to False.
streaming bool No If True, returns an iterable dataset for streaming access. Defaults to False.
num_proc Optional[int] No Number of processes for parallel dataset preparation.
**kwargs No Additional keyword arguments forwarded to the underlying builder.

AbstractDatasetInputStream Inputs

Name Type Required Description
features Optional[Features] No Explicit feature schema to apply to the loaded data.
cache_dir str No Directory for caching the processed dataset.
keep_in_memory bool No Whether to keep the dataset in memory. Defaults to False.
streaming bool No If True, returns an iterable dataset. Defaults to False.
num_proc Optional[int] No Number of processes for parallel dataset preparation.
**kwargs No Additional keyword arguments forwarded to the underlying builder.

Outputs

Name Type Description
(from AbstractDatasetReader.read()) Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict] The loaded dataset, whose concrete type depends on whether streaming is enabled and whether multiple splits are present.
(from AbstractDatasetInputStream.read()) Union[Dataset, IterableDataset] The loaded dataset, either map-style or iterable depending on the streaming parameter.

Usage Examples

Implementing a Custom Reader

from datasets.io.abc import AbstractDatasetReader
from datasets import Dataset

class MyCustomReader(AbstractDatasetReader):
    def read(self):
        # Custom logic to load data from self.path_or_paths
        data = load_my_format(self.path_or_paths)
        return Dataset.from_dict(data, features=self.features)

# Usage
reader = MyCustomReader("data/custom_file.dat", split="train")
dataset = reader.read()

Concrete Subclasses in the Library

# CsvDatasetReader, JsonDatasetReader, ParquetDatasetReader, etc.
# all extend AbstractDatasetReader with format-specific read() logic.
from datasets.io.csv import CsvDatasetReader

reader = CsvDatasetReader("data/train.csv", split="train")
dataset = reader.read()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment