Principle:Apache Paimon Multi Format Reading

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Columnar_Storage
Last Updated	2026-02-07 00:00 GMT

Overview

Mechanism for reading Paimon table data into multiple output formats (PyArrow, pandas, Ray) from a single scan plan.

Description

Paimon's read pipeline supports materializing data into multiple formats from the same set of splits. TableRead provides to_arrow(), to_pandas(), to_ray(), to_arrow_batch_reader(), and to_duckdb() methods. For Lance-format tables, the underlying FormatLanceReader handles the Lance-specific file reading while the TableRead adapter converts to the requested output format. This flexibility allows users to choose the best format for their downstream processing needs.

The read pipeline follows a layered architecture:

Scan planning: The scan produces a list of Split objects describing which data files to read
Format-specific reading: FormatLanceReader (or other format readers) reads the raw data files into Arrow RecordBatches
Format conversion: TableRead converts the Arrow RecordBatches into the requested output format

Usage

Use when data needs to be consumed in different formats for different processing stages. Common scenarios include:

pandas for exploration: Interactive data analysis and visualization in notebooks
PyArrow for interop: Passing data to other Arrow-compatible systems (DuckDB, Polars)
Ray for distributed processing: Large-scale parallel analytics and ML training
Arrow batch reader for streaming: Processing data in a streaming fashion without loading all into memory

Theoretical Basis

The adapter pattern allows a single read implementation to produce multiple output representations. The internal Arrow RecordBatch format serves as the universal intermediate representation.

Apache Arrow's columnar in-memory format provides a zero-copy interoperability layer between different processing engines. By using Arrow as the canonical intermediate format, Paimon avoids redundant serialization and deserialization when converting between output formats:

to_pandas() delegates to Arrow's built-in to_pandas() conversion
to_ray() wraps Arrow data in Ray's distributed dataset abstraction
to_duckdb() leverages DuckDB's native Arrow integration for zero-copy queries

This design follows the universal intermediate representation pattern, where a single well-defined format serves as the bridge between multiple systems.

Related Pages

Implemented By

Implementation:Apache_Paimon_TableRead_Multi_Format

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment