Implementation:Huggingface Datasets AbstractDatasetReader
| Knowledge Sources | |
|---|---|
| Domains | IO, Architecture |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Abstract base classes defining the interface for all dataset readers and input streams, provided by the HuggingFace Datasets library.
Description
AbstractDatasetReader and AbstractDatasetInputStream are abstract base classes (ABCs) that establish the contract for reading datasets from file paths or input streams respectively. All concrete reader implementations in the datasets.io package (such as CsvDatasetReader, JsonDatasetReader, ParquetDatasetReader) inherit from AbstractDatasetReader.
AbstractDatasetReader accepts file paths along with configuration for splits, features, caching, streaming, and multiprocessing. It defaults the split to "train" when a single path is provided (not a dict). Its abstract read() method returns a Dataset, DatasetDict, IterableDataset, or IterableDatasetDict.
AbstractDatasetInputStream is a simpler variant that does not accept path arguments, intended for stream-based inputs. Its abstract read() method returns a Dataset or IterableDataset.
Usage
These ABCs are not used directly by end users. They serve as the foundation for all format-specific reader classes in the library. Understanding them is useful for implementing custom readers or understanding the reader architecture.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/io/abc.py - Lines: 1-53
Signature
class AbstractDatasetReader(ABC):
def __init__(
self,
path_or_paths: Optional[NestedDataStructureLike[PathLike]] = None,
split: Optional[NamedSplit] = None,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
streaming: bool = False,
num_proc: Optional[int] = None,
**kwargs,
):
@abstractmethod
def read(self) -> Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict]:
pass
class AbstractDatasetInputStream(ABC):
def __init__(
self,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
streaming: bool = False,
num_proc: Optional[int] = None,
**kwargs,
):
@abstractmethod
def read(self) -> Union[Dataset, IterableDataset]:
pass
Import
from datasets.io.abc import AbstractDatasetReader, AbstractDatasetInputStream
I/O Contract
AbstractDatasetReader Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path_or_paths | Optional[NestedDataStructureLike[PathLike]] |
No | Path(s) to the data file(s). Can be a single path, list of paths, or dict mapping split names to paths. |
| split | Optional[NamedSplit] |
No | Name of the dataset split. Defaults to "train" when path_or_paths is not a dict.
|
| features | Optional[Features] |
No | Explicit feature schema to apply to the loaded data. |
| cache_dir | str |
No | Directory for caching the processed dataset. |
| keep_in_memory | bool |
No | Whether to keep the dataset in memory instead of memory-mapping. Defaults to False.
|
| streaming | bool |
No | If True, returns an iterable dataset for streaming access. Defaults to False.
|
| num_proc | Optional[int] |
No | Number of processes for parallel dataset preparation. |
| **kwargs | No | Additional keyword arguments forwarded to the underlying builder. |
AbstractDatasetInputStream Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| features | Optional[Features] |
No | Explicit feature schema to apply to the loaded data. |
| cache_dir | str |
No | Directory for caching the processed dataset. |
| keep_in_memory | bool |
No | Whether to keep the dataset in memory. Defaults to False.
|
| streaming | bool |
No | If True, returns an iterable dataset. Defaults to False.
|
| num_proc | Optional[int] |
No | Number of processes for parallel dataset preparation. |
| **kwargs | No | Additional keyword arguments forwarded to the underlying builder. |
Outputs
| Name | Type | Description |
|---|---|---|
(from AbstractDatasetReader.read()) |
Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict] |
The loaded dataset, whose concrete type depends on whether streaming is enabled and whether multiple splits are present. |
(from AbstractDatasetInputStream.read()) |
Union[Dataset, IterableDataset] |
The loaded dataset, either map-style or iterable depending on the streaming parameter. |
Usage Examples
Implementing a Custom Reader
from datasets.io.abc import AbstractDatasetReader
from datasets import Dataset
class MyCustomReader(AbstractDatasetReader):
def read(self):
# Custom logic to load data from self.path_or_paths
data = load_my_format(self.path_or_paths)
return Dataset.from_dict(data, features=self.features)
# Usage
reader = MyCustomReader("data/custom_file.dat", split="train")
dataset = reader.read()
Concrete Subclasses in the Library
# CsvDatasetReader, JsonDatasetReader, ParquetDatasetReader, etc.
# all extend AbstractDatasetReader with format-specific read() logic.
from datasets.io.csv import CsvDatasetReader
reader = CsvDatasetReader("data/train.csv", split="train")
dataset = reader.read()