Principle:Apache Paimon Scan Planning
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Table_Format |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for planning table reads by discovering data files, applying predicate pushdown, and generating splits for parallel reading.
Description
Scan planning is the bridge between a read request and actual data retrieval. It discovers relevant data files from the latest snapshot's manifests, applies predicate filters to prune partitions and files, and generates a list of Split objects that represent units of work for parallel reading. The ReadBuilder provides a fluent API for configuring filters, projections, and limits before creating a scan. PredicateBuilder constructs type-safe filter predicates that can be pushed down to the file level.
The planning phase is intentionally separated from the reading phase to allow the plan to be serialized, distributed, or inspected before execution. Each Split in the resulting plan is self-contained and can be read independently, enabling parallel and distributed processing.
Usage
Use this principle before reading data to optimize query performance. Predicate pushdown and column projection reduce I/O by skipping irrelevant partitions, files, and columns. The typical workflow involves: (1) obtaining a ReadBuilder from the table, (2) optionally configuring filters via PredicateBuilder and projections, (3) creating a TableScan, and (4) calling plan() to generate the list of splits.
Theoretical Basis
Implements the scan-then-read pattern common in data lake formats. Key optimization techniques include:
- Predicate pushdown: Filter predicates are pushed down to the manifest level, leveraging file-level statistics (min/max values, null counts) to skip entire data files that cannot contain matching rows.
- Partition pruning: Partition key predicates eliminate entire partition directories from the scan, avoiding manifest and file reads for irrelevant partitions.
- Column projection: Only the requested columns are read from columnar file formats (Parquet, ORC), reducing I/O bandwidth and memory usage.
- Split-based parallelism: The
Plancontains a list ofSplitobjects that can be distributed across parallel readers, enabling scalable read throughput. - Predicate composition:
PredicateBuildersupportsand_predicates()andor_predicates()for composing complex filter expressions from simple comparisons.