Implementation:Huggingface Datasets Arrow Builder
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Arrow |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Packaged dataset builder for loading Apache Arrow IPC format files, provided by the HuggingFace Datasets library.
Description
Arrow is a packaged dataset builder that extends datasets.ArrowBasedBuilder and handles loading data from Apache Arrow IPC files (both streaming and file formats). It is paired with ArrowConfig, a BuilderConfig subclass that optionally accepts a features parameter for explicit schema specification.
The builder automatically infers the feature schema from the Arrow file's schema if none is provided. During table generation, it attempts to open each file as an IPC stream first, falling back to the IPC file format if that fails. Each record batch is yielded individually as a PyArrow table, and an optional _cast_table step performs schema casting when features are explicitly specified, supporting nested feature reordering and type coercion (e.g., string to Audio).
Usage
Use this builder via load_dataset("arrow", data_files=...) to load Arrow IPC files. It is also triggered automatically when files with the .arrow extension are detected by the dataset loading pipeline.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/packaged_modules/arrow/arrow.py - Lines: 1-77
Signature
@dataclass
class ArrowConfig(datasets.BuilderConfig):
"""BuilderConfig for Arrow."""
features: Optional[datasets.Features] = None
class Arrow(datasets.ArrowBasedBuilder):
BUILDER_CONFIG_CLASS = ArrowConfig
Key methods:
def _info(self):
return datasets.DatasetInfo(features=self.config.features)
def _split_generators(self, dl_manager):
# Downloads data files and infers features from Arrow schema if needed
# Returns SplitGenerator for each split with file lists
def _cast_table(self, pa_table: pa.Table) -> pa.Table:
# Casts table to match explicit features schema using table_cast
# Supports nested features with keys in different order
def _generate_tables(self, files):
# Yields (Key, pa.Table) for each record batch in each file
# Tries IPC stream format first, falls back to IPC file format
Import
# Used via load_dataset
from datasets import load_dataset
ds = load_dataset("arrow", data_files="path/to/file.arrow")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_files | str, list, or dict |
Yes | Path(s) to Arrow IPC files. Can be a single path, a list, or a dict mapping split names to file paths. |
| features | Optional[datasets.Features] |
No | Explicit feature schema. If not provided, inferred from the Arrow file schema. |
Outputs
| Name | Type | Description |
|---|---|---|
(from _generate_tables) |
tuple[Key, pa.Table] |
Yields tuples of (Key(file_idx, batch_idx), pa_table) for each record batch in each Arrow file.
|
(from load_dataset) |
Dataset or DatasetDict |
The loaded dataset with Arrow-backed storage. |
Usage Examples
Basic Usage
from datasets import load_dataset
# Load a single Arrow file
ds = load_dataset("arrow", data_files="data/train.arrow", split="train")
print(ds[0])
# Load multiple splits
ds = load_dataset("arrow", data_files={
"train": "data/train.arrow",
"test": "data/test.arrow",
})
print(ds["train"][0])
With Explicit Features
from datasets import load_dataset, Features, Value
features = Features({
"text": Value("string"),
"label": Value("int64"),
})
ds = load_dataset("arrow", data_files="data/train.arrow", features=features, split="train")