Principle:Apache Paimon Split Based Reading

Knowledge Sources	Apache_Paimon
Domains	Data_Reading, Query_Execution
Last Updated	2026-02-08 00:00 GMT

Overview

Split-based reading is a parallelizable data access pattern that divides large datasets into independent units called splits, which can be processed concurrently by multiple readers.

Description

The split-based reading principle addresses the challenge of efficiently processing large datasets by decomposing them into smaller, independently processable units. Rather than reading an entire table sequentially, the system first generates a plan that divides the data into splits based on logical boundaries such as file boundaries, partition boundaries, or byte ranges. Each split represents a self-contained unit of work that can be assigned to a separate processing thread or worker.

This approach enables horizontal scalability in query execution. The split generation phase analyzes the table structure and produces a collection of splits that collectively cover all relevant data. The planning phase may apply filters, projections, and other optimizations to eliminate unnecessary splits or reduce the data read from each split. Different table types require different split generation strategies: append-only tables can generate splits directly from data files, while tables with update semantics may need to merge multiple versions of data during split processing.

The merge and read execution phase processes splits according to their characteristics. Simple splits may be read directly from storage, while complex scenarios involve merging data from multiple files, applying deletion vectors, or concatenating results from multiple sources. Sort-merge readers handle ordered data streams, while batch readers optimize for throughput. Format-specific readers handle different storage formats, ensuring that the split abstraction works uniformly across diverse storage layouts.

Usage

Apply split-based reading when implementing parallel query execution engines, distributed table scans, or any system that needs to process large datasets with multiple concurrent workers. This pattern is essential when you need to balance workload across processing resources or when data is naturally partitioned into logical units.

Theoretical Basis

The split-based reading pattern follows a three-phase approach:

Phase 1: Split Generation

Analyze table structure and identify logical boundaries
Create split descriptors containing metadata about data location and boundaries
Each split descriptor includes: file paths, byte ranges, partition information, schema version

Phase 2: Split Planning

Apply predicate pushdown to filter splits based on partition values
Apply projection pushdown to eliminate unnecessary column reads
Estimate split sizes and assign priorities for scheduling
Generate optimized execution plan for each split

Phase 3: Split Execution

Instantiate appropriate reader for each split type
Apply format-specific decoding and deserialization
Merge multiple data sources if split requires combining data
Stream results to downstream operators

Pseudo-code for split-based reading:

function executeSplitBasedQuery(table, filter, projection):
    splits = generateSplits(table)
    filteredSplits = applySplitFilters(splits, filter)

    results = parallelMap(filteredSplits, split => {
        reader = createReaderForSplit(split, projection)
        return reader.readAll()
    })

    return mergeResults(results)

function generateSplits(table):
    if table.isAppendOnly():
        return createFileLevelSplits(table.dataFiles())
    else if table.hasPrimaryKey():
        return createMergingSplits(table.dataFiles(), table.changelogFiles())
    else:
        return createCustomSplits(table.specificLogic())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment