Implementation:Apache Paimon Predicate
| Knowledge Sources | |
|---|---|
| Domains | Query Optimization, Data Filtering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Predicate implements a comprehensive framework for filtering rows during table scans, supporting value-level testing, statistics-based pruning, and PyArrow filter expression generation via a plugin-based tester architecture.
Description
The `Predicate` dataclass encapsulates a filter operation with method (operation type), field index, field name, and literal values. It supports three distinct evaluation paths: `test()` for row-level evaluation against `InternalRow` objects (used for in-memory filtering), `test_by_simple_stats()` for statistics-based file/split pruning using min/max values and null counts (enabling data skipping at the file level), and `to_arrow()` for generating PyArrow dataset filter expressions (enabling push-down to Parquet/ORC readers). Compound predicates (AND/OR) are supported via recursive composition, where literals contain sub-predicates. Individual operations are implemented as `Tester` subclasses (Equal, LessThan, GreaterThan, In, Between, StartsWith, IsNull, etc.) that auto-register via the `RegisterMeta` metaclass into the `Predicate.testers` dictionary, providing a plugin-like architecture where adding new operators requires only implementing a new Tester subclass. Each Tester provides three methods: `test_by_value()` for row-level evaluation, `test_by_stats()` for statistics pruning, and `test_by_arrow()` for PyArrow expression generation. String operations (startsWith, endsWith, contains) use PyArrow compute functions when generating arrow expressions, with fallback behavior.
This multi-level predicate evaluation enables efficient query optimization at every stage: files are skipped via statistics, Arrow readers apply filters natively, and any remaining filtering happens at the row level.
Usage
Predicates are constructed by the query planning layer and applied throughout the read pipeline for filtering and optimization.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/common/predicate.py
Signature
@dataclass
class Predicate:
method: str
index: Optional[int]
field: Optional[str]
literals: Optional[List[Any]] = None
def test(self, record: InternalRow) -> bool: ...
def test_by_simple_stats(self, stat: SimpleStats, row_count: int) -> bool: ...
def to_arrow(self) -> Any: ...
class Tester(ABC, metaclass=RegisterMeta):
name = None
@abstractmethod
def test_by_value(self, val, literals) -> bool: ...
@abstractmethod
def test_by_stats(self, min_v, max_v, literals) -> bool: ...
@abstractmethod
def test_by_arrow(self, val, literals) -> bool: ...
# Concrete testers: Equal, NotEqual, LessThan, GreaterThan, In, Between, etc.
Import
from pypaimon.common.predicate import Predicate
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| method | str | yes | Predicate operation (e.g., "equal", "lessThan", "and", "or") |
| index | int | no | Field index in the row (required for non-compound predicates) |
| field | str | no | Field name (required for Arrow expression generation) |
| literals | List[Any] | no | Literal values or sub-predicates for compound operations |
Outputs
| Name | Type | Description |
|---|---|---|
| test result | bool | True if row/stats match the predicate |
| arrow expression | pyarrow.Expression | Filter expression for PyArrow dataset |
Usage Examples
Row-Level Filtering
from pypaimon.common.predicate import Predicate
from pypaimon.table.row.generic_row import GenericRow
from pypaimon.schema.data_types import DataField, AtomicType
# Create predicate: age >= 18
predicate = Predicate(
method="greaterOrEqual",
index=1,
field="age",
literals=[18]
)
# Test against a row
fields = [DataField(0, "name", AtomicType("STRING")),
DataField(1, "age", AtomicType("INT"))]
row = GenericRow(["Alice", 25], fields)
result = predicate.test(row) # True
Statistics-Based Pruning
from pypaimon.manifest.schema.simple_stats import SimpleStats
# File statistics: age min=10, max=30, nulls=0
stats = SimpleStats(
min_values=GenericRow([10], [DataField(1, "age", AtomicType("INT"))]),
max_values=GenericRow([30], [DataField(1, "age", AtomicType("INT"))]),
null_counts=[0]
)
# Check if file contains rows matching age >= 18
can_contain = predicate.test_by_simple_stats(stats, row_count=1000) # True
# Check age >= 50 (will skip file)
predicate2 = Predicate(method="greaterOrEqual", index=0, field="age", literals=[50])
can_contain = predicate2.test_by_simple_stats(stats, row_count=1000) # False
PyArrow Filter Generation
import pyarrow.dataset as ds
# Generate PyArrow filter expression
arrow_filter = predicate.to_arrow()
# Use in PyArrow dataset scan
dataset = ds.dataset("/path/to/parquet", format="parquet")
filtered_data = dataset.to_table(filter=arrow_filter)
Compound Predicates
# Create AND predicate: age >= 18 AND age <= 65
predicate_and = Predicate(
method="and",
index=None,
field=None,
literals=[
Predicate(method="greaterOrEqual", index=1, field="age", literals=[18]),
Predicate(method="lessOrEqual", index=1, field="age", literals=[65])
]
)
result = predicate_and.test(row) # True for age=25