Implementation:Apache Paimon TableRead To Pandas Verification
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Data_Ingestion |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for reading Paimon table data as pandas DataFrame for post-ingestion verification.
Description
Uses the standard Paimon read pipeline: table.new_read_builder() creates a read builder, new_scan() creates a scanner, plan() generates a scan plan, and splits() returns the list of data splits. The new_read().to_pandas(splits) call then reads all splits and returns a single pandas DataFrame.
In the verification context, this confirms that the Ray-written data is correctly visible after commit. The to_pandas() method:
- Reads each split from the file store.
- Converts the data from Paimon's internal format to Arrow record batches.
- Concatenates all record batches into a single Arrow Table.
- Converts the Arrow Table to a pandas DataFrame using to_pandas().
The resulting DataFrame can be inspected for row counts, column types, and data values to validate the ingestion.
Usage
Use as the final step in a Ray-based Paimon ingestion pipeline to verify that data was correctly written and committed. Call after write_ray() completes successfully.
Code Reference
Source Location
- Repository: Apache Paimon
- File: paimon-python/pypaimon/read/table_read.py:L119-121
Signature
class TableRead:
def to_pandas(self, splits: List[Split]) -> pandas.DataFrame:
Import
from pypaimon.read.table_read import TableRead
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| splits | List[Split] | Yes | List of data splits obtained from scan.plan().splits(). Each split represents a range of data in the file store to read. |
Outputs
| Name | Type | Description |
|---|---|---|
| df | pandas.DataFrame | A pandas DataFrame containing all rows from the specified splits. Columns correspond to the Paimon table's schema fields, with types converted from Paimon/Arrow to pandas types. |
Usage Examples
Basic Usage
# Verify ingestion
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
plan = scan.plan()
splits = plan.splits()
reader = read_builder.new_read()
df = reader.to_pandas(splits)
print(f"Verified: {len(df)} rows ingested")
print(df.head())
print(df.dtypes)
Comprehensive Verification
# Full verification with assertions
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()
df = reader.to_pandas(splits)
# Validate row count
expected_rows = dataset.count()
actual_rows = len(df)
assert actual_rows == expected_rows, (
f"Row count mismatch: expected {expected_rows}, got {actual_rows}"
)
# Validate schema
expected_columns = ['user_id', 'event', 'timestamp', 'value']
assert list(df.columns) == expected_columns, (
f"Column mismatch: expected {expected_columns}, got {list(df.columns)}"
)
# Validate no null primary keys
assert df['user_id'].notna().all(), "Found null values in primary key column"
print(f"Verification passed: {actual_rows} rows, {len(df.columns)} columns")