Implementation:Huggingface Datasets Arrow Builder

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Loading, Arrow
Last Updated	2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading Apache Arrow IPC format files, provided by the HuggingFace Datasets library.

Description

Arrow is a packaged dataset builder that extends datasets.ArrowBasedBuilder and handles loading data from Apache Arrow IPC files (both streaming and file formats). It is paired with ArrowConfig, a BuilderConfig subclass that optionally accepts a features parameter for explicit schema specification.

The builder automatically infers the feature schema from the Arrow file's schema if none is provided. During table generation, it attempts to open each file as an IPC stream first, falling back to the IPC file format if that fails. Each record batch is yielded individually as a PyArrow table, and an optional _cast_table step performs schema casting when features are explicitly specified, supporting nested feature reordering and type coercion (e.g., string to Audio).

Usage

Use this builder via load_dataset("arrow", data_files=...) to load Arrow IPC files. It is also triggered automatically when files with the .arrow extension are detected by the dataset loading pipeline.

Code Reference

Source Location

Repository: datasets
File: src/datasets/packaged_modules/arrow/arrow.py
Lines: 1-77

Signature

@dataclass
class ArrowConfig(datasets.BuilderConfig):
    """BuilderConfig for Arrow."""
    features: Optional[datasets.Features] = None

class Arrow(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = ArrowConfig

Key methods:

def _info(self):
    return datasets.DatasetInfo(features=self.config.features)

def _split_generators(self, dl_manager):
    # Downloads data files and infers features from Arrow schema if needed
    # Returns SplitGenerator for each split with file lists

def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    # Casts table to match explicit features schema using table_cast
    # Supports nested features with keys in different order

def _generate_tables(self, files):
    # Yields (Key, pa.Table) for each record batch in each file
    # Tries IPC stream format first, falls back to IPC file format

Import

# Used via load_dataset
from datasets import load_dataset
ds = load_dataset("arrow", data_files="path/to/file.arrow")

I/O Contract

Inputs

Name	Type	Required	Description
data_files	`str`, `list`, or `dict`	Yes	Path(s) to Arrow IPC files. Can be a single path, a list, or a dict mapping split names to file paths.
features	`Optional[datasets.Features]`	No	Explicit feature schema. If not provided, inferred from the Arrow file schema.

Outputs

Name	Type	Description
(from `_generate_tables`)	`tuple[Key, pa.Table]`	Yields tuples of `(Key(file_idx, batch_idx), pa_table)` for each record batch in each Arrow file.
(from `load_dataset`)	`Dataset` or `DatasetDict`	The loaded dataset with Arrow-backed storage.

Usage Examples

Basic Usage

from datasets import load_dataset

# Load a single Arrow file
ds = load_dataset("arrow", data_files="data/train.arrow", split="train")
print(ds[0])

# Load multiple splits
ds = load_dataset("arrow", data_files={
    "train": "data/train.arrow",
    "test": "data/test.arrow",
})
print(ds["train"][0])

With Explicit Features

from datasets import load_dataset, Features, Value

features = Features({
    "text": Value("string"),
    "label": Value("int64"),
})
ds = load_dataset("arrow", data_files="data/train.arrow", features=features, split="train")

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Arrow_Dataset_Building

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment