Implementation:Huggingface Datasets AbstractDatasetReader

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	IO, Architecture
Last Updated	2026-02-14 18:00 GMT

Overview

Abstract base classes defining the interface for all dataset readers and input streams, provided by the HuggingFace Datasets library.

Description

AbstractDatasetReader and AbstractDatasetInputStream are abstract base classes (ABCs) that establish the contract for reading datasets from file paths or input streams respectively. All concrete reader implementations in the datasets.io package (such as CsvDatasetReader, JsonDatasetReader, ParquetDatasetReader) inherit from AbstractDatasetReader.

AbstractDatasetReader accepts file paths along with configuration for splits, features, caching, streaming, and multiprocessing. It defaults the split to "train" when a single path is provided (not a dict). Its abstract read() method returns a Dataset, DatasetDict, IterableDataset, or IterableDatasetDict.

AbstractDatasetInputStream is a simpler variant that does not accept path arguments, intended for stream-based inputs. Its abstract read() method returns a Dataset or IterableDataset.

Usage

These ABCs are not used directly by end users. They serve as the foundation for all format-specific reader classes in the library. Understanding them is useful for implementing custom readers or understanding the reader architecture.

Code Reference

Source Location

Repository: datasets
File: src/datasets/io/abc.py
Lines: 1-53

Signature

class AbstractDatasetReader(ABC):
    def __init__(
        self,
        path_or_paths: Optional[NestedDataStructureLike[PathLike]] = None,
        split: Optional[NamedSplit] = None,
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        streaming: bool = False,
        num_proc: Optional[int] = None,
        **kwargs,
    ):

    @abstractmethod
    def read(self) -> Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict]:
        pass

class AbstractDatasetInputStream(ABC):
    def __init__(
        self,
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        streaming: bool = False,
        num_proc: Optional[int] = None,
        **kwargs,
    ):

    @abstractmethod
    def read(self) -> Union[Dataset, IterableDataset]:
        pass

Import

from datasets.io.abc import AbstractDatasetReader, AbstractDatasetInputStream

I/O Contract

AbstractDatasetReader Inputs

Name	Type	Required	Description
path_or_paths	`Optional[NestedDataStructureLike[PathLike]]`	No	Path(s) to the data file(s). Can be a single path, list of paths, or dict mapping split names to paths.
split	`Optional[NamedSplit]`	No	Name of the dataset split. Defaults to `"train"` when path_or_paths is not a dict.
features	`Optional[Features]`	No	Explicit feature schema to apply to the loaded data.
cache_dir	`str`	No	Directory for caching the processed dataset.
keep_in_memory	`bool`	No	Whether to keep the dataset in memory instead of memory-mapping. Defaults to `False`.
streaming	`bool`	No	If `True`, returns an iterable dataset for streaming access. Defaults to `False`.
num_proc	`Optional[int]`	No	Number of processes for parallel dataset preparation.
**kwargs		No	Additional keyword arguments forwarded to the underlying builder.

AbstractDatasetInputStream Inputs

Name	Type	Required	Description
features	`Optional[Features]`	No	Explicit feature schema to apply to the loaded data.
cache_dir	`str`	No	Directory for caching the processed dataset.
keep_in_memory	`bool`	No	Whether to keep the dataset in memory. Defaults to `False`.
streaming	`bool`	No	If `True`, returns an iterable dataset. Defaults to `False`.
num_proc	`Optional[int]`	No	Number of processes for parallel dataset preparation.
**kwargs		No	Additional keyword arguments forwarded to the underlying builder.

Outputs

Name	Type	Description
(from `AbstractDatasetReader.read()`)	`Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict]`	The loaded dataset, whose concrete type depends on whether streaming is enabled and whether multiple splits are present.
(from `AbstractDatasetInputStream.read()`)	`Union[Dataset, IterableDataset]`	The loaded dataset, either map-style or iterable depending on the streaming parameter.

Usage Examples

Implementing a Custom Reader

from datasets.io.abc import AbstractDatasetReader
from datasets import Dataset

class MyCustomReader(AbstractDatasetReader):
    def read(self):
        # Custom logic to load data from self.path_or_paths
        data = load_my_format(self.path_or_paths)
        return Dataset.from_dict(data, features=self.features)

# Usage
reader = MyCustomReader("data/custom_file.dat", split="train")
dataset = reader.read()

Concrete Subclasses in the Library

# CsvDatasetReader, JsonDatasetReader, ParquetDatasetReader, etc.
# all extend AbstractDatasetReader with format-specific read() logic.
from datasets.io.csv import CsvDatasetReader

reader = CsvDatasetReader("data/train.csv", split="train")
dataset = reader.read()

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Abstract_Dataset_IO

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment