Implementation:Huggingface Datasets ArrowReader
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for reading cached Arrow IPC files into memory-mapped tables for Dataset construction provided by the HuggingFace Datasets library.
Description
ArrowReader is a subclass of BaseReader that reads Apache Arrow IPC files from a cache directory and produces the keyword arguments needed to construct a Dataset object. It sets the filetype suffix to "arrow", implements _get_table_from_filename to open individual shard files using either memory mapping or in-memory reading, and applies skip/take slicing to each table. The inherited read method computes file instructions from split metadata and the requested read instruction (supporting percentage-based and absolute slicing), then calls read_files which loads all relevant shards in parallel using thread mapping, concatenates the resulting tables, and returns a dictionary containing the Arrow table, dataset info, and split name.
Usage
ArrowReader is used internally by DatasetBuilder._as_dataset when constructing a Dataset from prepared Arrow files. It is not typically instantiated directly by end users, but understanding its behavior is useful for debugging cache issues or implementing custom dataset builders.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_reader.py - Lines: L285-L314 (ArrowReader class), with base class at L167-L282
Signature
class ArrowReader(BaseReader):
"""
Build a Dataset object out of Instruction instance(s).
This Reader uses either memory mapping or file descriptors (in-memory) on arrow files.
"""
def __init__(self, path: str, info: Optional["DatasetInfo"]):
Key inherited methods:
def read(
self,
name,
instructions,
split_infos,
in_memory=False,
):
def read_files(
self,
files: list[dict],
original_instructions: Union[None, "ReadInstruction", "Split"] = None,
in_memory=False,
):
Import
from datasets.arrow_reader import ArrowReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str |
Yes | Path to the directory where Arrow shard files are stored. |
| info | DatasetInfo |
Yes | Dataset information including feature schemas and split metadata. |
read() method inputs:
| Name | Type | Required | Description |
|---|---|---|---|
| name | str |
Yes | Name of the dataset (used to locate shard files). |
| instructions | ReadInstruction or str |
Yes | Instructions specifying which split/slice to read (e.g., "train", "train[:10%]").
|
| split_infos | list[SplitInfo] |
Yes | Available split information for the dataset (filenames, lengths, shard lengths). |
| in_memory | bool |
No | Whether to copy the data in-memory. Defaults to False (memory-mapped).
|
Outputs
| Name | Type | Description |
|---|---|---|
(return value from read) |
dict |
Dictionary with keys "arrow_table" (a concatenated Arrow table), "info" (DatasetInfo), and "split" (NamedSplit). These are passed as keyword arguments to the Dataset constructor.
|
Usage Examples
Basic Usage
from datasets.arrow_reader import ArrowReader
# Typically used internally by DatasetBuilder._as_dataset:
reader = ArrowReader(cache_dir, dataset_info)
dataset_kwargs = reader.read(
name="rotten_tomatoes",
instructions="train",
split_infos=dataset_info.splits.values(),
in_memory=False,
)
# dataset_kwargs contains: {"arrow_table": ..., "info": ..., "split": ...}
from datasets import Dataset
ds = Dataset(**dataset_kwargs)