Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets ArrowReader

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for reading cached Arrow IPC files into memory-mapped tables for Dataset construction provided by the HuggingFace Datasets library.

Description

ArrowReader is a subclass of BaseReader that reads Apache Arrow IPC files from a cache directory and produces the keyword arguments needed to construct a Dataset object. It sets the filetype suffix to "arrow", implements _get_table_from_filename to open individual shard files using either memory mapping or in-memory reading, and applies skip/take slicing to each table. The inherited read method computes file instructions from split metadata and the requested read instruction (supporting percentage-based and absolute slicing), then calls read_files which loads all relevant shards in parallel using thread mapping, concatenates the resulting tables, and returns a dictionary containing the Arrow table, dataset info, and split name.

Usage

ArrowReader is used internally by DatasetBuilder._as_dataset when constructing a Dataset from prepared Arrow files. It is not typically instantiated directly by end users, but understanding its behavior is useful for debugging cache issues or implementing custom dataset builders.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_reader.py
  • Lines: L285-L314 (ArrowReader class), with base class at L167-L282

Signature

class ArrowReader(BaseReader):
    """
    Build a Dataset object out of Instruction instance(s).
    This Reader uses either memory mapping or file descriptors (in-memory) on arrow files.
    """

    def __init__(self, path: str, info: Optional["DatasetInfo"]):

Key inherited methods:

def read(
    self,
    name,
    instructions,
    split_infos,
    in_memory=False,
):

def read_files(
    self,
    files: list[dict],
    original_instructions: Union[None, "ReadInstruction", "Split"] = None,
    in_memory=False,
):

Import

from datasets.arrow_reader import ArrowReader

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to the directory where Arrow shard files are stored.
info DatasetInfo Yes Dataset information including feature schemas and split metadata.

read() method inputs:

Name Type Required Description
name str Yes Name of the dataset (used to locate shard files).
instructions ReadInstruction or str Yes Instructions specifying which split/slice to read (e.g., "train", "train[:10%]").
split_infos list[SplitInfo] Yes Available split information for the dataset (filenames, lengths, shard lengths).
in_memory bool No Whether to copy the data in-memory. Defaults to False (memory-mapped).

Outputs

Name Type Description
(return value from read) dict Dictionary with keys "arrow_table" (a concatenated Arrow table), "info" (DatasetInfo), and "split" (NamedSplit). These are passed as keyword arguments to the Dataset constructor.

Usage Examples

Basic Usage

from datasets.arrow_reader import ArrowReader

# Typically used internally by DatasetBuilder._as_dataset:
reader = ArrowReader(cache_dir, dataset_info)
dataset_kwargs = reader.read(
    name="rotten_tomatoes",
    instructions="train",
    split_infos=dataset_info.splits.values(),
    in_memory=False,
)
# dataset_kwargs contains: {"arrow_table": ..., "info": ..., "split": ...}
from datasets import Dataset
ds = Dataset(**dataset_kwargs)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment