Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Arrow Builder

From Leeroopedia
Knowledge Sources
Domains Data_Loading, Arrow
Last Updated 2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading Apache Arrow IPC format files, provided by the HuggingFace Datasets library.

Description

Arrow is a packaged dataset builder that extends datasets.ArrowBasedBuilder and handles loading data from Apache Arrow IPC files (both streaming and file formats). It is paired with ArrowConfig, a BuilderConfig subclass that optionally accepts a features parameter for explicit schema specification.

The builder automatically infers the feature schema from the Arrow file's schema if none is provided. During table generation, it attempts to open each file as an IPC stream first, falling back to the IPC file format if that fails. Each record batch is yielded individually as a PyArrow table, and an optional _cast_table step performs schema casting when features are explicitly specified, supporting nested feature reordering and type coercion (e.g., string to Audio).

Usage

Use this builder via load_dataset("arrow", data_files=...) to load Arrow IPC files. It is also triggered automatically when files with the .arrow extension are detected by the dataset loading pipeline.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/packaged_modules/arrow/arrow.py
  • Lines: 1-77

Signature

@dataclass
class ArrowConfig(datasets.BuilderConfig):
    """BuilderConfig for Arrow."""
    features: Optional[datasets.Features] = None

class Arrow(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = ArrowConfig

Key methods:

def _info(self):
    return datasets.DatasetInfo(features=self.config.features)

def _split_generators(self, dl_manager):
    # Downloads data files and infers features from Arrow schema if needed
    # Returns SplitGenerator for each split with file lists

def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    # Casts table to match explicit features schema using table_cast
    # Supports nested features with keys in different order

def _generate_tables(self, files):
    # Yields (Key, pa.Table) for each record batch in each file
    # Tries IPC stream format first, falls back to IPC file format

Import

# Used via load_dataset
from datasets import load_dataset
ds = load_dataset("arrow", data_files="path/to/file.arrow")

I/O Contract

Inputs

Name Type Required Description
data_files str, list, or dict Yes Path(s) to Arrow IPC files. Can be a single path, a list, or a dict mapping split names to file paths.
features Optional[datasets.Features] No Explicit feature schema. If not provided, inferred from the Arrow file schema.

Outputs

Name Type Description
(from _generate_tables) tuple[Key, pa.Table] Yields tuples of (Key(file_idx, batch_idx), pa_table) for each record batch in each Arrow file.
(from load_dataset) Dataset or DatasetDict The loaded dataset with Arrow-backed storage.

Usage Examples

Basic Usage

from datasets import load_dataset

# Load a single Arrow file
ds = load_dataset("arrow", data_files="data/train.arrow", split="train")
print(ds[0])

# Load multiple splits
ds = load_dataset("arrow", data_files={
    "train": "data/train.arrow",
    "test": "data/test.arrow",
})
print(ds["train"][0])

With Explicit Features

from datasets import load_dataset, Features, Value

features = Features({
    "text": Value("string"),
    "label": Value("int64"),
})
ds = load_dataset("arrow", data_files="data/train.arrow", features=features, split="train")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment