Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Scan Planning

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Table_Format
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for planning table reads by discovering data files, applying predicate pushdown, and generating splits for parallel reading.

Description

Scan planning is the bridge between a read request and actual data retrieval. It discovers relevant data files from the latest snapshot's manifests, applies predicate filters to prune partitions and files, and generates a list of Split objects that represent units of work for parallel reading. The ReadBuilder provides a fluent API for configuring filters, projections, and limits before creating a scan. PredicateBuilder constructs type-safe filter predicates that can be pushed down to the file level.

The planning phase is intentionally separated from the reading phase to allow the plan to be serialized, distributed, or inspected before execution. Each Split in the resulting plan is self-contained and can be read independently, enabling parallel and distributed processing.

Usage

Use this principle before reading data to optimize query performance. Predicate pushdown and column projection reduce I/O by skipping irrelevant partitions, files, and columns. The typical workflow involves: (1) obtaining a ReadBuilder from the table, (2) optionally configuring filters via PredicateBuilder and projections, (3) creating a TableScan, and (4) calling plan() to generate the list of splits.

Theoretical Basis

Implements the scan-then-read pattern common in data lake formats. Key optimization techniques include:

  • Predicate pushdown: Filter predicates are pushed down to the manifest level, leveraging file-level statistics (min/max values, null counts) to skip entire data files that cannot contain matching rows.
  • Partition pruning: Partition key predicates eliminate entire partition directories from the scan, avoiding manifest and file reads for irrelevant partitions.
  • Column projection: Only the requested columns are read from columnar file formats (Parquet, ORC), reducing I/O bandwidth and memory usage.
  • Split-based parallelism: The Plan contains a list of Split objects that can be distributed across parallel readers, enabling scalable read throughput.
  • Predicate composition: PredicateBuilder supports and_predicates() and or_predicates() for composing complex filter expressions from simple comparisons.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment