Principle:Apache Paimon Lance Predicate Pushdown

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Columnar_Storage
Last Updated	2026-02-07 00:00 GMT

Overview

Mechanism for pushing filter predicates down to the Lance file reader level for optimized data scanning.

Description

Predicate pushdown on Lance files allows the Lance reader to skip data that does not match filter conditions without reading it into memory. When a predicate is configured via ReadBuilder.with_filter(), it is converted to a PyArrow dataset filter expression and passed to the FormatLanceReader. The Lance reader uses this expression to filter data during file reading, leveraging Lance's built-in statistics and indexing for efficient predicate evaluation. This is more efficient than post-read filtering because it reduces I/O at the storage level.

The predicate pushdown pipeline consists of the following stages:

Predicate construction: The user builds a predicate using PredicateBuilder methods such as greater_than(), equal(), or between()
Predicate propagation: The predicate is passed through the ReadBuilder to the scan and read pipeline
Filter conversion: The predicate is converted to a PyArrow dataset filter expression compatible with Lance
Storage-level filtering: The Lance reader applies the filter during file reading, skipping non-matching data

Usage

Use when querying Lance-format tables with known filter conditions to minimize data read from storage. Predicate pushdown is most effective when:

High selectivity filters: The predicate eliminates a large fraction of rows
Large datasets: The I/O savings are proportional to the amount of data skipped
Indexed columns: Lance can leverage column statistics and indexes for faster predicate evaluation
Chained filters: Multiple predicates can be combined for compound filtering

Theoretical Basis

Predicate pushdown is a query optimization technique from database systems. By evaluating predicates at the storage layer, unnecessary data never enters the processing pipeline, reducing I/O, memory usage, and CPU time.

The optimization follows a general principle of pushing computation closer to data. In a traditional query pipeline without pushdown:

All data is read from storage into memory
Filter predicates are evaluated on the in-memory data
Non-matching rows are discarded

With predicate pushdown:

The storage layer evaluates predicates using file-level statistics (min/max values, bloom filters)
Only data pages or row groups that may contain matching rows are read
Fine-grained filtering is applied during deserialization

Lance's columnar format supports this optimization particularly well because:

Each column stores statistics (min, max, null count) per data page
The columnar layout allows evaluating predicates on individual columns without reading other columns
Lance's indexing structures enable sub-linear predicate evaluation for indexed columns

Related Pages

Implemented By

Implementation:Apache_Paimon_FormatLanceReader_Predicate_Pushdown

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment