Principle:Huggingface Datasets Arrow File Reading
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Arrow File Reading is the process of loading cached Apache Arrow IPC files into memory-mapped or in-memory tables for constructing Dataset objects.
Description
Apache Arrow IPC (Inter-Process Communication) files are the primary on-disk format used by the HuggingFace Datasets library to store prepared dataset shards. Arrow File Reading handles the low-level mechanics of opening these files and producing Arrow tables that back the high-level Dataset API.
The reading process addresses several concerns:
- Memory mapping vs. in-memory: By default, Arrow files are memory-mapped, meaning the operating system pages data from disk into RAM on demand. This allows datasets larger than available memory to be used. Alternatively, files can be read entirely into memory for maximum access speed at the cost of higher memory usage.
- Sharded reading: Datasets may be split across multiple shard files. The reader handles instructions specifying which files to read and which rows to skip/take within each file, enabling efficient partial reads (e.g., reading only the first 10% of a split).
- Table concatenation: When a read spans multiple shards, the resulting Arrow tables are concatenated into a single logical table while preserving zero-copy semantics where possible.
- Thread-parallel loading: When many shard files exist, loading is parallelized across threads with a progress bar to provide feedback.
- Format abstraction: The reading layer is abstracted behind a base class, allowing different on-disk formats (Arrow IPC, Parquet) to share the same interface for split resolution and file instruction computation.
Usage
Apply Arrow File Reading when:
- Constructing a
Datasetfrom on-disk Arrow files after the download-and-prepare phase. - Loading a subset of a split using percentage-based or absolute index slicing.
- Reading from a cache directory where dataset shards have been previously written.
- Deciding between memory-mapped and in-memory access based on available resources and access patterns.
Theoretical Basis
The reading pipeline operates as follows:
READ(name, instructions, split_infos, in_memory):
1. COMPUTE file instructions from split_infos and read instruction:
- Map split name to shard filenames
- Compute skip/take for each shard based on requested slice
2. For each file instruction (possibly in parallel):
a. OPEN Arrow IPC file:
- If in_memory=False: memory-map the file
- If in_memory=True: read entire file into memory
b. APPLY skip/take by slicing the Arrow table
3. FILTER out empty tables
4. CONCATENATE all non-empty tables into a single Arrow table
5. Return {arrow_table, info, split} as kwargs for Dataset construction
The skip/take mechanism supports both contiguous reads (for non-sharded splits) and shard-aware reads (where the skip/take window may span multiple shard boundaries).