Principle:Apache Paimon Data Retrieval
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Table_Format |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for reading planned splits into in-memory data structures such as PyArrow Tables and pandas DataFrames.
Description
Data retrieval is the final step in the read pipeline where planned splits are materialized into usable data. TableRead converts splits into Arrow RecordBatches, which can be collected into PyArrow Tables, pandas DataFrames, streaming RecordBatchReaders, or row-level iterators. The read process handles data file format differences (Avro, Parquet, ORC, Lance), applies deletion vectors for merge-on-read semantics, and supports data evolution (schema changes across snapshots).
Each split is processed independently, meaning the read operation is inherently parallelizable. The choice of output format determines the memory characteristics of the read: to_arrow() and to_pandas() materialize all data into memory at once, while to_arrow_batch_reader() and to_iterator() provide streaming access suitable for large datasets that do not fit in memory.
Usage
Use this principle after scan planning to retrieve actual data. Choose the output format based on downstream processing needs: PyArrow for in-memory analytics, pandas for data science workflows, RecordBatchReader for streaming large datasets, and iterator for row-level processing. The typical workflow involves: (1) obtaining splits from scan planning, (2) creating a TableRead from the ReadBuilder, and (3) calling the appropriate conversion method.
Theoretical Basis
Implements materialization of logical scan plans into physical data. Key concepts include:
- Format abstraction: The read layer abstracts over multiple underlying file formats (Parquet, ORC, Avro, Lance), presenting a uniform Arrow-based interface to consumers.
- Multiple output adapters: Adapter methods (
to_arrow(),to_pandas(),to_arrow_batch_reader(),to_iterator()) convert the internal representation to the format most suitable for the downstream consumer. - Split-level independence: Each split can be read independently, enabling parallel processing. Results from individual splits are concatenated into the final output.
- Merge-on-read: For tables with primary keys, the read process applies deletion vectors and merges multiple versions of the same row to present a consistent, deduplicated view.
- Schema evolution: The read process handles schema differences across snapshots, applying column additions, removals, and type promotions transparently.