Principle:Apache Paimon Distributed Dataset Creation
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Distributed_Computing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for converting Paimon table splits into distributed Ray Datasets for parallel processing.
Description
Distributed dataset creation bridges Paimon's scan-then-read pipeline with Ray's distributed data processing framework. After scan planning produces a list of splits, TableRead.to_ray() wraps each split as a Ray read task via RayDatasource. Ray then schedules these tasks across available workers, creating a distributed Dataset that can be processed in parallel. The RayDatasource adapter handles split distribution, Arrow schema propagation, and block management to ensure efficient parallelism.
The pipeline consists of:
- Scan planning - produces a list of splits describing the data to read
- Datasource creation - wraps splits into a RayDatasource adapter
- Task scheduling - Ray distributes read tasks across workers
- Block assembly - each worker produces Arrow RecordBatches that form Dataset blocks
Usage
Use this principle when processing large Paimon tables that benefit from distributed execution. The resulting Ray Dataset supports map, filter, groupby, and other parallel transformations.
Theoretical Basis
Follows the data-parallel processing model where independent data partitions (splits) are processed concurrently across workers. The Datasource abstraction in Ray Data provides a pluggable interface for custom data sources. Paimon's split-based architecture maps naturally to Ray's block-based parallel execution model.
Key properties of the data-parallel model:
- Independence - each split can be read without coordination with other splits
- Locality - Ray scheduler can co-locate tasks with data when possible
- Scalability - adding workers linearly increases read throughput
- Fault tolerance - failed read tasks can be retried independently