Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Lance Predicate Pushdown

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Columnar_Storage
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for pushing filter predicates down to the Lance file reader level for optimized data scanning.

Description

Predicate pushdown on Lance files allows the Lance reader to skip data that does not match filter conditions without reading it into memory. When a predicate is configured via ReadBuilder.with_filter(), it is converted to a PyArrow dataset filter expression and passed to the FormatLanceReader. The Lance reader uses this expression to filter data during file reading, leveraging Lance's built-in statistics and indexing for efficient predicate evaluation. This is more efficient than post-read filtering because it reduces I/O at the storage level.

The predicate pushdown pipeline consists of the following stages:

  1. Predicate construction: The user builds a predicate using PredicateBuilder methods such as greater_than(), equal(), or between()
  2. Predicate propagation: The predicate is passed through the ReadBuilder to the scan and read pipeline
  3. Filter conversion: The predicate is converted to a PyArrow dataset filter expression compatible with Lance
  4. Storage-level filtering: The Lance reader applies the filter during file reading, skipping non-matching data

Usage

Use when querying Lance-format tables with known filter conditions to minimize data read from storage. Predicate pushdown is most effective when:

  • High selectivity filters: The predicate eliminates a large fraction of rows
  • Large datasets: The I/O savings are proportional to the amount of data skipped
  • Indexed columns: Lance can leverage column statistics and indexes for faster predicate evaluation
  • Chained filters: Multiple predicates can be combined for compound filtering

Theoretical Basis

Predicate pushdown is a query optimization technique from database systems. By evaluating predicates at the storage layer, unnecessary data never enters the processing pipeline, reducing I/O, memory usage, and CPU time.

The optimization follows a general principle of pushing computation closer to data. In a traditional query pipeline without pushdown:

  1. All data is read from storage into memory
  2. Filter predicates are evaluated on the in-memory data
  3. Non-matching rows are discarded

With predicate pushdown:

  1. The storage layer evaluates predicates using file-level statistics (min/max values, bloom filters)
  2. Only data pages or row groups that may contain matching rows are read
  3. Fine-grained filtering is applied during deserialization

Lance's columnar format supports this optimization particularly well because:

  • Each column stores statistics (min, max, null count) per data page
  • The columnar layout allows evaluating predicates on individual columns without reading other columns
  • Lance's indexing structures enable sub-linear predicate evaluation for indexed columns

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment