Implementation:Huggingface Datasets ArrowReader

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for reading cached Arrow IPC files into memory-mapped tables for Dataset construction provided by the HuggingFace Datasets library.

Description

ArrowReader is a subclass of BaseReader that reads Apache Arrow IPC files from a cache directory and produces the keyword arguments needed to construct a Dataset object. It sets the filetype suffix to "arrow", implements _get_table_from_filename to open individual shard files using either memory mapping or in-memory reading, and applies skip/take slicing to each table. The inherited read method computes file instructions from split metadata and the requested read instruction (supporting percentage-based and absolute slicing), then calls read_files which loads all relevant shards in parallel using thread mapping, concatenates the resulting tables, and returns a dictionary containing the Arrow table, dataset info, and split name.

Usage

ArrowReader is used internally by DatasetBuilder._as_dataset when constructing a Dataset from prepared Arrow files. It is not typically instantiated directly by end users, but understanding its behavior is useful for debugging cache issues or implementing custom dataset builders.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_reader.py
Lines: L285-L314 (ArrowReader class), with base class at L167-L282

Signature

class ArrowReader(BaseReader):
    """
    Build a Dataset object out of Instruction instance(s).
    This Reader uses either memory mapping or file descriptors (in-memory) on arrow files.
    """

    def __init__(self, path: str, info: Optional["DatasetInfo"]):

Key inherited methods:

def read(
    self,
    name,
    instructions,
    split_infos,
    in_memory=False,
):

def read_files(
    self,
    files: list[dict],
    original_instructions: Union[None, "ReadInstruction", "Split"] = None,
    in_memory=False,
):

Import

from datasets.arrow_reader import ArrowReader

I/O Contract

Inputs

Name	Type	Required	Description
path	`str`	Yes	Path to the directory where Arrow shard files are stored.
info	`DatasetInfo`	Yes	Dataset information including feature schemas and split metadata.

read() method inputs:

Name	Type	Required	Description
name	`str`	Yes	Name of the dataset (used to locate shard files).
instructions	`ReadInstruction` or `str`	Yes	Instructions specifying which split/slice to read (e.g., `"train"`, `"train[:10%]"`).
split_infos	`list[SplitInfo]`	Yes	Available split information for the dataset (filenames, lengths, shard lengths).
in_memory	`bool`	No	Whether to copy the data in-memory. Defaults to `False` (memory-mapped).

Outputs

Name	Type	Description
(return value from `read`)	`dict`	Dictionary with keys `"arrow_table"` (a concatenated Arrow table), `"info"` (`DatasetInfo`), and `"split"` (`NamedSplit`). These are passed as keyword arguments to the `Dataset` constructor.

Usage Examples

Basic Usage

from datasets.arrow_reader import ArrowReader

# Typically used internally by DatasetBuilder._as_dataset:
reader = ArrowReader(cache_dir, dataset_info)
dataset_kwargs = reader.read(
    name="rotten_tomatoes",
    instructions="train",
    split_infos=dataset_info.splits.values(),
    in_memory=False,
)
# dataset_kwargs contains: {"arrow_table": ..., "info": ..., "split": ...}
from datasets import Dataset
ds = Dataset(**dataset_kwargs)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Arrow_File_Reading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment