Principle:Apache Paimon Multi Format Reading
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Columnar_Storage |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for reading Paimon table data into multiple output formats (PyArrow, pandas, Ray) from a single scan plan.
Description
Paimon's read pipeline supports materializing data into multiple formats from the same set of splits. TableRead provides to_arrow(), to_pandas(), to_ray(), to_arrow_batch_reader(), and to_duckdb() methods. For Lance-format tables, the underlying FormatLanceReader handles the Lance-specific file reading while the TableRead adapter converts to the requested output format. This flexibility allows users to choose the best format for their downstream processing needs.
The read pipeline follows a layered architecture:
- Scan planning: The scan produces a list of Split objects describing which data files to read
- Format-specific reading: FormatLanceReader (or other format readers) reads the raw data files into Arrow RecordBatches
- Format conversion: TableRead converts the Arrow RecordBatches into the requested output format
Usage
Use when data needs to be consumed in different formats for different processing stages. Common scenarios include:
- pandas for exploration: Interactive data analysis and visualization in notebooks
- PyArrow for interop: Passing data to other Arrow-compatible systems (DuckDB, Polars)
- Ray for distributed processing: Large-scale parallel analytics and ML training
- Arrow batch reader for streaming: Processing data in a streaming fashion without loading all into memory
Theoretical Basis
The adapter pattern allows a single read implementation to produce multiple output representations. The internal Arrow RecordBatch format serves as the universal intermediate representation.
Apache Arrow's columnar in-memory format provides a zero-copy interoperability layer between different processing engines. By using Arrow as the canonical intermediate format, Paimon avoids redundant serialization and deserialization when converting between output formats:
- to_pandas() delegates to Arrow's built-in to_pandas() conversion
- to_ray() wraps Arrow data in Ray's distributed dataset abstraction
- to_duckdb() leverages DuckDB's native Arrow integration for zero-copy queries
This design follows the universal intermediate representation pattern, where a single well-defined format serves as the bridge between multiple systems.