Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Apache Paimon Split Based Reading

From Leeroopedia


Knowledge Sources
Domains Data_Reading, Query_Execution
Last Updated 2026-02-08 00:00 GMT

Overview

Split-based reading is a parallelizable data access pattern that divides large datasets into independent units called splits, which can be processed concurrently by multiple readers.

Description

The split-based reading principle addresses the challenge of efficiently processing large datasets by decomposing them into smaller, independently processable units. Rather than reading an entire table sequentially, the system first generates a plan that divides the data into splits based on logical boundaries such as file boundaries, partition boundaries, or byte ranges. Each split represents a self-contained unit of work that can be assigned to a separate processing thread or worker.

This approach enables horizontal scalability in query execution. The split generation phase analyzes the table structure and produces a collection of splits that collectively cover all relevant data. The planning phase may apply filters, projections, and other optimizations to eliminate unnecessary splits or reduce the data read from each split. Different table types require different split generation strategies: append-only tables can generate splits directly from data files, while tables with update semantics may need to merge multiple versions of data during split processing.

The merge and read execution phase processes splits according to their characteristics. Simple splits may be read directly from storage, while complex scenarios involve merging data from multiple files, applying deletion vectors, or concatenating results from multiple sources. Sort-merge readers handle ordered data streams, while batch readers optimize for throughput. Format-specific readers handle different storage formats, ensuring that the split abstraction works uniformly across diverse storage layouts.

Usage

Apply split-based reading when implementing parallel query execution engines, distributed table scans, or any system that needs to process large datasets with multiple concurrent workers. This pattern is essential when you need to balance workload across processing resources or when data is naturally partitioned into logical units.

Theoretical Basis

The split-based reading pattern follows a three-phase approach:

Phase 1: Split Generation

  • Analyze table structure and identify logical boundaries
  • Create split descriptors containing metadata about data location and boundaries
  • Each split descriptor includes: file paths, byte ranges, partition information, schema version

Phase 2: Split Planning

  • Apply predicate pushdown to filter splits based on partition values
  • Apply projection pushdown to eliminate unnecessary column reads
  • Estimate split sizes and assign priorities for scheduling
  • Generate optimized execution plan for each split

Phase 3: Split Execution

  • Instantiate appropriate reader for each split type
  • Apply format-specific decoding and deserialization
  • Merge multiple data sources if split requires combining data
  • Stream results to downstream operators

Pseudo-code for split-based reading:

function executeSplitBasedQuery(table, filter, projection):
    splits = generateSplits(table)
    filteredSplits = applySplitFilters(splits, filter)

    results = parallelMap(filteredSplits, split => {
        reader = createReaderForSplit(split, projection)
        return reader.readAll()
    })

    return mergeResults(results)

function generateSplits(table):
    if table.isAppendOnly():
        return createFileLevelSplits(table.dataFiles())
    else if table.hasPrimaryKey():
        return createMergingSplits(table.dataFiles(), table.changelogFiles())
    else:
        return createCustomSplits(table.specificLogic())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment