Implementation:Apache Paimon Predicate

Knowledge Sources	Apache_Paimon
Domains	Query Optimization, Data Filtering
Last Updated	2026-02-08 00:00 GMT

Overview

Predicate implements a comprehensive framework for filtering rows during table scans, supporting value-level testing, statistics-based pruning, and PyArrow filter expression generation via a plugin-based tester architecture.

Description

The `Predicate` dataclass encapsulates a filter operation with method (operation type), field index, field name, and literal values. It supports three distinct evaluation paths: `test()` for row-level evaluation against `InternalRow` objects (used for in-memory filtering), `test_by_simple_stats()` for statistics-based file/split pruning using min/max values and null counts (enabling data skipping at the file level), and `to_arrow()` for generating PyArrow dataset filter expressions (enabling push-down to Parquet/ORC readers). Compound predicates (AND/OR) are supported via recursive composition, where literals contain sub-predicates. Individual operations are implemented as `Tester` subclasses (Equal, LessThan, GreaterThan, In, Between, StartsWith, IsNull, etc.) that auto-register via the `RegisterMeta` metaclass into the `Predicate.testers` dictionary, providing a plugin-like architecture where adding new operators requires only implementing a new Tester subclass. Each Tester provides three methods: `test_by_value()` for row-level evaluation, `test_by_stats()` for statistics pruning, and `test_by_arrow()` for PyArrow expression generation. String operations (startsWith, endsWith, contains) use PyArrow compute functions when generating arrow expressions, with fallback behavior.

This multi-level predicate evaluation enables efficient query optimization at every stage: files are skipped via statistics, Arrow readers apply filters natively, and any remaining filtering happens at the row level.

Usage

Predicates are constructed by the query planning layer and applied throughout the read pipeline for filtering and optimization.

Code Reference

Source Location

Repository: Apache_Paimon
File: paimon-python/pypaimon/common/predicate.py

Signature

@dataclass
class Predicate:
    method: str
    index: Optional[int]
    field: Optional[str]
    literals: Optional[List[Any]] = None

    def test(self, record: InternalRow) -> bool: ...
    def test_by_simple_stats(self, stat: SimpleStats, row_count: int) -> bool: ...
    def to_arrow(self) -> Any: ...

class Tester(ABC, metaclass=RegisterMeta):
    name = None

    @abstractmethod
    def test_by_value(self, val, literals) -> bool: ...
    @abstractmethod
    def test_by_stats(self, min_v, max_v, literals) -> bool: ...
    @abstractmethod
    def test_by_arrow(self, val, literals) -> bool: ...

# Concrete testers: Equal, NotEqual, LessThan, GreaterThan, In, Between, etc.

Import

from pypaimon.common.predicate import Predicate

I/O Contract

Inputs

Name	Type	Required	Description
method	str	yes	Predicate operation (e.g., "equal", "lessThan", "and", "or")
index	int	no	Field index in the row (required for non-compound predicates)
field	str	no	Field name (required for Arrow expression generation)
literals	List[Any]	no	Literal values or sub-predicates for compound operations

Outputs

Name	Type	Description
test result	bool	True if row/stats match the predicate
arrow expression	pyarrow.Expression	Filter expression for PyArrow dataset

Usage Examples

Row-Level Filtering

from pypaimon.common.predicate import Predicate
from pypaimon.table.row.generic_row import GenericRow
from pypaimon.schema.data_types import DataField, AtomicType

# Create predicate: age >= 18
predicate = Predicate(
    method="greaterOrEqual",
    index=1,
    field="age",
    literals=[18]
)

# Test against a row
fields = [DataField(0, "name", AtomicType("STRING")),
          DataField(1, "age", AtomicType("INT"))]
row = GenericRow(["Alice", 25], fields)

result = predicate.test(row)  # True

Statistics-Based Pruning

from pypaimon.manifest.schema.simple_stats import SimpleStats

# File statistics: age min=10, max=30, nulls=0
stats = SimpleStats(
    min_values=GenericRow([10], [DataField(1, "age", AtomicType("INT"))]),
    max_values=GenericRow([30], [DataField(1, "age", AtomicType("INT"))]),
    null_counts=[0]
)

# Check if file contains rows matching age >= 18
can_contain = predicate.test_by_simple_stats(stats, row_count=1000)  # True

# Check age >= 50 (will skip file)
predicate2 = Predicate(method="greaterOrEqual", index=0, field="age", literals=[50])
can_contain = predicate2.test_by_simple_stats(stats, row_count=1000)  # False

PyArrow Filter Generation

import pyarrow.dataset as ds

# Generate PyArrow filter expression
arrow_filter = predicate.to_arrow()

# Use in PyArrow dataset scan
dataset = ds.dataset("/path/to/parquet", format="parquet")
filtered_data = dataset.to_table(filter=arrow_filter)

Compound Predicates

# Create AND predicate: age >= 18 AND age <= 65
predicate_and = Predicate(
    method="and",
    index=None,
    field=None,
    literals=[
        Predicate(method="greaterOrEqual", index=1, field="age", literals=[18]),
        Predicate(method="lessOrEqual", index=1, field="age", literals=[65])
    ]
)

result = predicate_and.test(row)  # True for age=25

Related Pages

Principle:Apache_Paimon_Utility_Infrastructure

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment