Principle:Huggingface Datasets Arrow File Reading

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Arrow File Reading is the process of loading cached Apache Arrow IPC files into memory-mapped or in-memory tables for constructing Dataset objects.

Description

Apache Arrow IPC (Inter-Process Communication) files are the primary on-disk format used by the HuggingFace Datasets library to store prepared dataset shards. Arrow File Reading handles the low-level mechanics of opening these files and producing Arrow tables that back the high-level Dataset API.

The reading process addresses several concerns:

Memory mapping vs. in-memory: By default, Arrow files are memory-mapped, meaning the operating system pages data from disk into RAM on demand. This allows datasets larger than available memory to be used. Alternatively, files can be read entirely into memory for maximum access speed at the cost of higher memory usage.
Sharded reading: Datasets may be split across multiple shard files. The reader handles instructions specifying which files to read and which rows to skip/take within each file, enabling efficient partial reads (e.g., reading only the first 10% of a split).
Table concatenation: When a read spans multiple shards, the resulting Arrow tables are concatenated into a single logical table while preserving zero-copy semantics where possible.
Thread-parallel loading: When many shard files exist, loading is parallelized across threads with a progress bar to provide feedback.
Format abstraction: The reading layer is abstracted behind a base class, allowing different on-disk formats (Arrow IPC, Parquet) to share the same interface for split resolution and file instruction computation.

Usage

Apply Arrow File Reading when:

Constructing a Dataset from on-disk Arrow files after the download-and-prepare phase.
Loading a subset of a split using percentage-based or absolute index slicing.
Reading from a cache directory where dataset shards have been previously written.
Deciding between memory-mapped and in-memory access based on available resources and access patterns.

Theoretical Basis

The reading pipeline operates as follows:

READ(name, instructions, split_infos, in_memory):
  1. COMPUTE file instructions from split_infos and read instruction:
     - Map split name to shard filenames
     - Compute skip/take for each shard based on requested slice
  2. For each file instruction (possibly in parallel):
     a. OPEN Arrow IPC file:
        - If in_memory=False: memory-map the file
        - If in_memory=True: read entire file into memory
     b. APPLY skip/take by slicing the Arrow table
  3. FILTER out empty tables
  4. CONCATENATE all non-empty tables into a single Arrow table
  5. Return {arrow_table, info, split} as kwargs for Dataset construction

The skip/take mechanism supports both contiguous reads (for non-sharded splits) and shard-aware reads (where the skip/take window may span multiple shard boundaries).

Related Pages

Implemented By

Implementation:Huggingface_Datasets_ArrowReader

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment