Principle:Apache Paimon Split Based Reading
| Knowledge Sources | |
|---|---|
| Domains | Data_Reading, Query_Execution |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Split-based reading is a parallelizable data access pattern that divides large datasets into independent units called splits, which can be processed concurrently by multiple readers.
Description
The split-based reading principle addresses the challenge of efficiently processing large datasets by decomposing them into smaller, independently processable units. Rather than reading an entire table sequentially, the system first generates a plan that divides the data into splits based on logical boundaries such as file boundaries, partition boundaries, or byte ranges. Each split represents a self-contained unit of work that can be assigned to a separate processing thread or worker.
This approach enables horizontal scalability in query execution. The split generation phase analyzes the table structure and produces a collection of splits that collectively cover all relevant data. The planning phase may apply filters, projections, and other optimizations to eliminate unnecessary splits or reduce the data read from each split. Different table types require different split generation strategies: append-only tables can generate splits directly from data files, while tables with update semantics may need to merge multiple versions of data during split processing.
The merge and read execution phase processes splits according to their characteristics. Simple splits may be read directly from storage, while complex scenarios involve merging data from multiple files, applying deletion vectors, or concatenating results from multiple sources. Sort-merge readers handle ordered data streams, while batch readers optimize for throughput. Format-specific readers handle different storage formats, ensuring that the split abstraction works uniformly across diverse storage layouts.
Usage
Apply split-based reading when implementing parallel query execution engines, distributed table scans, or any system that needs to process large datasets with multiple concurrent workers. This pattern is essential when you need to balance workload across processing resources or when data is naturally partitioned into logical units.
Theoretical Basis
The split-based reading pattern follows a three-phase approach:
Phase 1: Split Generation
- Analyze table structure and identify logical boundaries
- Create split descriptors containing metadata about data location and boundaries
- Each split descriptor includes: file paths, byte ranges, partition information, schema version
Phase 2: Split Planning
- Apply predicate pushdown to filter splits based on partition values
- Apply projection pushdown to eliminate unnecessary column reads
- Estimate split sizes and assign priorities for scheduling
- Generate optimized execution plan for each split
Phase 3: Split Execution
- Instantiate appropriate reader for each split type
- Apply format-specific decoding and deserialization
- Merge multiple data sources if split requires combining data
- Stream results to downstream operators
Pseudo-code for split-based reading:
function executeSplitBasedQuery(table, filter, projection):
splits = generateSplits(table)
filteredSplits = applySplitFilters(splits, filter)
results = parallelMap(filteredSplits, split => {
reader = createReaderForSplit(split, projection)
return reader.readAll()
})
return mergeResults(results)
function generateSplits(table):
if table.isAppendOnly():
return createFileLevelSplits(table.dataFiles())
else if table.hasPrimaryKey():
return createMergingSplits(table.dataFiles(), table.changelogFiles())
else:
return createCustomSplits(table.specificLogic())
Related Pages
- Implementation:Apache_Paimon_SplitRead
- Implementation:Apache_Paimon_SplitGenerator
- Implementation:Apache_Paimon_AppendTableSplitGenerator
- Implementation:Apache_Paimon_PrimaryKeyTableSplitGenerator
- Implementation:Apache_Paimon_DataEvolutionSplitGenerator
- Implementation:Apache_Paimon_SlicedSplit
- Implementation:Apache_Paimon_IntervalPartition
- Implementation:Apache_Paimon_PushDownUtils
- Implementation:Apache_Paimon_ConcatBatchReader
- Implementation:Apache_Paimon_DataFileBatchReader
- Implementation:Apache_Paimon_FieldBunch
- Implementation:Apache_Paimon_SortMergeReader
- Implementation:Apache_Paimon_FormatAvroReader
- Implementation:Apache_Paimon_FormatPyarrowReader