Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove IpcReader

From Leeroopedia
Knowledge Sources
Domains Data Processing, ETL
Last Updated 2026-02-14 17:00 GMT

Overview

IpcReader is a pipeline reader that reads data from Apache Arrow IPC (Inter-Process Communication) files, supporting both file and stream formats.

Description

IpcReader extends BaseDiskReader to provide ingestion of data stored in Apache Arrow's IPC format. Arrow IPC is a high-performance binary columnar format designed for zero-copy data exchange between processes. The reader supports two distinct reading modes: file mode (the default), which reads record batches from a random-access IPC file, and stream mode, which reads from an Arrow IPC stream that must be consumed sequentially.

In file mode, the reader uses pa.ipc.open_file and iterates over record batches by index. In stream mode, it uses pa.ipc.open_stream and iterates over batches as they arrive. Each batch is converted to a Python list of dictionaries using batch.to_pylist(), and each dictionary is then transformed into a Document via get_document_from_dict.

The class requires the pyarrow dependency, which is declared via the _requires_dependencies class attribute. Like all Datatrove readers, it supports file sharding, progress tracking, document limiting, skipping, custom adapters, and configurable text/ID key mappings.

Usage

Use IpcReader when your data is stored in Apache Arrow IPC format (.arrow or .ipc files). This is common in data engineering workflows that use Arrow for efficient columnar data interchange, or when data has been serialized from Arrow-based frameworks such as Polars or PyArrow.

Code Reference

Source Location

Signature

class IpcReader(BaseDiskReader):
    name = "🪶 Ipc"
    _requires_dependencies = ["pyarrow"]

    def __init__(
        self,
        data_folder: DataFolderLike,
        paths_file: DataFileLike | None = None,
        limit: int = -1,
        skip: int = 0,
        stream: bool = False,
        file_progress: bool = False,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        recursive: bool = True,
        glob_pattern: str | None = None,
        shuffle_files: bool = False,
    )

Import

from datatrove.pipeline.readers.ipc import IpcReader

I/O Contract

Inputs

Name Type Required Description
data_folder DataFolderLike Yes Path or filesystem object pointing to the folder containing IPC files
paths_file DataFileLike or None No Optional file listing specific paths to read (one per line)
limit int No Maximum number of documents to read; -1 means no limit
skip int No Number of initial rows to skip
stream bool No If True, reads using IPC stream format instead of file format (default: False)
file_progress bool No Whether to show a progress bar for files
doc_progress bool No Whether to show a progress bar for documents
adapter Callable No Custom function to transform raw data dicts into Document-compatible dicts
text_key str No Column name containing document text (default: "text")
id_key str No Column name containing document ID (default: "id")
default_metadata dict No Default metadata added to all documents
recursive bool No Whether to search for files recursively (default: True)
glob_pattern str or None No Glob pattern to filter which files are included
shuffle_files bool No Whether to shuffle files within the returned shard

Outputs

Name Type Description
documents Generator[Document] Yields Document objects extracted from Arrow IPC record batches

Usage Examples

Basic Usage

from datatrove.pipeline.readers.ipc import IpcReader

# Read Arrow IPC files from a local directory
reader = IpcReader(
    data_folder="path/to/arrow/files",
    text_key="text",
    id_key="id",
)

Stream Mode

from datatrove.pipeline.readers.ipc import IpcReader

# Read IPC stream files
reader = IpcReader(
    data_folder="path/to/stream/files",
    stream=True,
    glob_pattern="*.arrows",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment