Implementation:Huggingface Datatrove IpcReader
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, ETL |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
IpcReader is a pipeline reader that reads data from Apache Arrow IPC (Inter-Process Communication) files, supporting both file and stream formats.
Description
IpcReader extends BaseDiskReader to provide ingestion of data stored in Apache Arrow's IPC format. Arrow IPC is a high-performance binary columnar format designed for zero-copy data exchange between processes. The reader supports two distinct reading modes: file mode (the default), which reads record batches from a random-access IPC file, and stream mode, which reads from an Arrow IPC stream that must be consumed sequentially.
In file mode, the reader uses pa.ipc.open_file and iterates over record batches by index. In stream mode, it uses pa.ipc.open_stream and iterates over batches as they arrive. Each batch is converted to a Python list of dictionaries using batch.to_pylist(), and each dictionary is then transformed into a Document via get_document_from_dict.
The class requires the pyarrow dependency, which is declared via the _requires_dependencies class attribute. Like all Datatrove readers, it supports file sharding, progress tracking, document limiting, skipping, custom adapters, and configurable text/ID key mappings.
Usage
Use IpcReader when your data is stored in Apache Arrow IPC format (.arrow or .ipc files). This is common in data engineering workflows that use Arrow for efficient columnar data interchange, or when data has been serialized from Arrow-based frameworks such as Polars or PyArrow.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/readers/ipc.py
- Lines: 1-96
Signature
class IpcReader(BaseDiskReader):
name = "🪶 Ipc"
_requires_dependencies = ["pyarrow"]
def __init__(
self,
data_folder: DataFolderLike,
paths_file: DataFileLike | None = None,
limit: int = -1,
skip: int = 0,
stream: bool = False,
file_progress: bool = False,
doc_progress: bool = False,
adapter: Callable = None,
text_key: str = "text",
id_key: str = "id",
default_metadata: dict = None,
recursive: bool = True,
glob_pattern: str | None = None,
shuffle_files: bool = False,
)
Import
from datatrove.pipeline.readers.ipc import IpcReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | DataFolderLike | Yes | Path or filesystem object pointing to the folder containing IPC files |
| paths_file | DataFileLike or None | No | Optional file listing specific paths to read (one per line) |
| limit | int | No | Maximum number of documents to read; -1 means no limit |
| skip | int | No | Number of initial rows to skip |
| stream | bool | No | If True, reads using IPC stream format instead of file format (default: False) |
| file_progress | bool | No | Whether to show a progress bar for files |
| doc_progress | bool | No | Whether to show a progress bar for documents |
| adapter | Callable | No | Custom function to transform raw data dicts into Document-compatible dicts |
| text_key | str | No | Column name containing document text (default: "text") |
| id_key | str | No | Column name containing document ID (default: "id") |
| default_metadata | dict | No | Default metadata added to all documents |
| recursive | bool | No | Whether to search for files recursively (default: True) |
| glob_pattern | str or None | No | Glob pattern to filter which files are included |
| shuffle_files | bool | No | Whether to shuffle files within the returned shard |
Outputs
| Name | Type | Description |
|---|---|---|
| documents | Generator[Document] | Yields Document objects extracted from Arrow IPC record batches |
Usage Examples
Basic Usage
from datatrove.pipeline.readers.ipc import IpcReader
# Read Arrow IPC files from a local directory
reader = IpcReader(
data_folder="path/to/arrow/files",
text_key="text",
id_key="id",
)
Stream Mode
from datatrove.pipeline.readers.ipc import IpcReader
# Read IPC stream files
reader = IpcReader(
data_folder="path/to/stream/files",
stream=True,
glob_pattern="*.arrows",
)