Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Apache Paimon Predicate

From Leeroopedia


Knowledge Sources
Domains Query Optimization, Data Filtering
Last Updated 2026-02-08 00:00 GMT

Overview

Predicate implements a comprehensive framework for filtering rows during table scans, supporting value-level testing, statistics-based pruning, and PyArrow filter expression generation via a plugin-based tester architecture.

Description

The `Predicate` dataclass encapsulates a filter operation with method (operation type), field index, field name, and literal values. It supports three distinct evaluation paths: `test()` for row-level evaluation against `InternalRow` objects (used for in-memory filtering), `test_by_simple_stats()` for statistics-based file/split pruning using min/max values and null counts (enabling data skipping at the file level), and `to_arrow()` for generating PyArrow dataset filter expressions (enabling push-down to Parquet/ORC readers). Compound predicates (AND/OR) are supported via recursive composition, where literals contain sub-predicates. Individual operations are implemented as `Tester` subclasses (Equal, LessThan, GreaterThan, In, Between, StartsWith, IsNull, etc.) that auto-register via the `RegisterMeta` metaclass into the `Predicate.testers` dictionary, providing a plugin-like architecture where adding new operators requires only implementing a new Tester subclass. Each Tester provides three methods: `test_by_value()` for row-level evaluation, `test_by_stats()` for statistics pruning, and `test_by_arrow()` for PyArrow expression generation. String operations (startsWith, endsWith, contains) use PyArrow compute functions when generating arrow expressions, with fallback behavior.

This multi-level predicate evaluation enables efficient query optimization at every stage: files are skipped via statistics, Arrow readers apply filters natively, and any remaining filtering happens at the row level.

Usage

Predicates are constructed by the query planning layer and applied throughout the read pipeline for filtering and optimization.

Code Reference

Source Location

Signature

@dataclass
class Predicate:
    method: str
    index: Optional[int]
    field: Optional[str]
    literals: Optional[List[Any]] = None

    def test(self, record: InternalRow) -> bool: ...
    def test_by_simple_stats(self, stat: SimpleStats, row_count: int) -> bool: ...
    def to_arrow(self) -> Any: ...

class Tester(ABC, metaclass=RegisterMeta):
    name = None

    @abstractmethod
    def test_by_value(self, val, literals) -> bool: ...
    @abstractmethod
    def test_by_stats(self, min_v, max_v, literals) -> bool: ...
    @abstractmethod
    def test_by_arrow(self, val, literals) -> bool: ...

# Concrete testers: Equal, NotEqual, LessThan, GreaterThan, In, Between, etc.

Import

from pypaimon.common.predicate import Predicate

I/O Contract

Inputs

Name Type Required Description
method str yes Predicate operation (e.g., "equal", "lessThan", "and", "or")
index int no Field index in the row (required for non-compound predicates)
field str no Field name (required for Arrow expression generation)
literals List[Any] no Literal values or sub-predicates for compound operations

Outputs

Name Type Description
test result bool True if row/stats match the predicate
arrow expression pyarrow.Expression Filter expression for PyArrow dataset

Usage Examples

Row-Level Filtering

from pypaimon.common.predicate import Predicate
from pypaimon.table.row.generic_row import GenericRow
from pypaimon.schema.data_types import DataField, AtomicType

# Create predicate: age >= 18
predicate = Predicate(
    method="greaterOrEqual",
    index=1,
    field="age",
    literals=[18]
)

# Test against a row
fields = [DataField(0, "name", AtomicType("STRING")),
          DataField(1, "age", AtomicType("INT"))]
row = GenericRow(["Alice", 25], fields)

result = predicate.test(row)  # True

Statistics-Based Pruning

from pypaimon.manifest.schema.simple_stats import SimpleStats

# File statistics: age min=10, max=30, nulls=0
stats = SimpleStats(
    min_values=GenericRow([10], [DataField(1, "age", AtomicType("INT"))]),
    max_values=GenericRow([30], [DataField(1, "age", AtomicType("INT"))]),
    null_counts=[0]
)

# Check if file contains rows matching age >= 18
can_contain = predicate.test_by_simple_stats(stats, row_count=1000)  # True

# Check age >= 50 (will skip file)
predicate2 = Predicate(method="greaterOrEqual", index=0, field="age", literals=[50])
can_contain = predicate2.test_by_simple_stats(stats, row_count=1000)  # False

PyArrow Filter Generation

import pyarrow.dataset as ds

# Generate PyArrow filter expression
arrow_filter = predicate.to_arrow()

# Use in PyArrow dataset scan
dataset = ds.dataset("/path/to/parquet", format="parquet")
filtered_data = dataset.to_table(filter=arrow_filter)

Compound Predicates

# Create AND predicate: age >= 18 AND age <= 65
predicate_and = Predicate(
    method="and",
    index=None,
    field=None,
    literals=[
        Predicate(method="greaterOrEqual", index=1, field="age", literals=[18]),
        Predicate(method="lessOrEqual", index=1, field="age", literals=[65])
    ]
)

result = predicate_and.test(row)  # True for age=25

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment