Implementation:Huggingface Datatrove IpcReader

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, ETL
Last Updated	2026-02-14 17:00 GMT

Overview

IpcReader is a pipeline reader that reads data from Apache Arrow IPC (Inter-Process Communication) files, supporting both file and stream formats.

Description

IpcReader extends BaseDiskReader to provide ingestion of data stored in Apache Arrow's IPC format. Arrow IPC is a high-performance binary columnar format designed for zero-copy data exchange between processes. The reader supports two distinct reading modes: file mode (the default), which reads record batches from a random-access IPC file, and stream mode, which reads from an Arrow IPC stream that must be consumed sequentially.

In file mode, the reader uses pa.ipc.open_file and iterates over record batches by index. In stream mode, it uses pa.ipc.open_stream and iterates over batches as they arrive. Each batch is converted to a Python list of dictionaries using batch.to_pylist(), and each dictionary is then transformed into a Document via get_document_from_dict.

The class requires the pyarrow dependency, which is declared via the _requires_dependencies class attribute. Like all Datatrove readers, it supports file sharding, progress tracking, document limiting, skipping, custom adapters, and configurable text/ID key mappings.

Usage

Use IpcReader when your data is stored in Apache Arrow IPC format (.arrow or .ipc files). This is common in data engineering workflows that use Arrow for efficient columnar data interchange, or when data has been serialized from Arrow-based frameworks such as Polars or PyArrow.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/readers/ipc.py
Lines: 1-96

Signature

class IpcReader(BaseDiskReader):
    name = "🪶 Ipc"
    _requires_dependencies = ["pyarrow"]

    def __init__(
        self,
        data_folder: DataFolderLike,
        paths_file: DataFileLike | None = None,
        limit: int = -1,
        skip: int = 0,
        stream: bool = False,
        file_progress: bool = False,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        recursive: bool = True,
        glob_pattern: str | None = None,
        shuffle_files: bool = False,
    )

Import

from datatrove.pipeline.readers.ipc import IpcReader

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	DataFolderLike	Yes	Path or filesystem object pointing to the folder containing IPC files
paths_file	DataFileLike or None	No	Optional file listing specific paths to read (one per line)
limit	int	No	Maximum number of documents to read; -1 means no limit
skip	int	No	Number of initial rows to skip
stream	bool	No	If True, reads using IPC stream format instead of file format (default: False)
file_progress	bool	No	Whether to show a progress bar for files
doc_progress	bool	No	Whether to show a progress bar for documents
adapter	Callable	No	Custom function to transform raw data dicts into Document-compatible dicts
text_key	str	No	Column name containing document text (default: "text")
id_key	str	No	Column name containing document ID (default: "id")
default_metadata	dict	No	Default metadata added to all documents
recursive	bool	No	Whether to search for files recursively (default: True)
glob_pattern	str or None	No	Glob pattern to filter which files are included
shuffle_files	bool	No	Whether to shuffle files within the returned shard

Outputs

Name	Type	Description
documents	Generator[Document]	Yields Document objects extracted from Arrow IPC record batches

Usage Examples

Basic Usage

from datatrove.pipeline.readers.ipc import IpcReader

# Read Arrow IPC files from a local directory
reader = IpcReader(
    data_folder="path/to/arrow/files",
    text_key="text",
    id_key="id",
)

Stream Mode

from datatrove.pipeline.readers.ipc import IpcReader

# Read IPC stream files
reader = IpcReader(
    data_folder="path/to/stream/files",
    stream=True,
    glob_pattern="*.arrows",
)

Related Pages

Principle:Huggingface_Datatrove_IPC_Data_Reading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment