Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon IndexedSplit Result Retrieval

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Vector_Search
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for wrapping data splits with index results including row ranges and similarity scores.

Description

IndexedSplit extends Split and wraps a data split with row_ranges (List[Range]) and optional scores (List[float]). It provides efficient row-level access to index results:

  • contains_row_id(): Checks if a given row ID falls within any of the matched row ranges.
  • get_score(): Retrieves the similarity score for a specific row ID, returning None if the row is not in the matched ranges or if scores are not available.
  • row_count: Property that returns the total count of matched rows across all row ranges.

IndexedSplit delegates all standard Split properties (files, partition, bucket) to the underlying data_split, maintaining full compatibility with the read pipeline.

The Range class provides foundational range operations:

  • contains(): Tests whether a value falls within the range bounds (inclusive start, exclusive end).
  • count(): Returns the number of elements in the range.
  • merge(): Combines overlapping or adjacent ranges into a single range.

Usage

Use IndexedSplit after index evaluation to wrap data splits with row-level filtering information. Pass the resulting IndexedSplit instances to the read pipeline for efficient skip-scan retrieval.

Code Reference

Source Location

  • Repository: Apache Paimon
  • File: paimon-python/pypaimon/globalindex/indexed_split.py:L28-142
  • File: paimon-python/pypaimon/globalindex/range.py:L25-191

Signature

class IndexedSplit(Split):
    def __init__(
        self,
        data_split: Split,
        row_ranges: List[Range],
        scores: Optional[List[float]] = None,
    ):

    def data_split(self) -> Split:
    def row_ranges(self) -> List[Range]:
    def scores(self) -> Optional[List[float]]:
    def contains_row_id(self, row_id: int) -> bool:
    def get_score(self, row_id: int) -> Optional[float]:
    @property
    def row_count(self) -> int:

class Range:
    def __init__(self, from_: int, to: int):
    def contains(self, value: int) -> bool:
    def count(self) -> int:

Import

from pypaimon.globalindex.indexed_split import IndexedSplit
from pypaimon.globalindex.range import Range

I/O Contract

Inputs

Name Type Required Description
data_split Split Yes The underlying data split containing the physical file references
row_ranges List[Range] Yes List of contiguous row ranges that matched the index query
scores Optional[List[float]] No Per-row similarity scores from vector search (aligned with row ranges)
row_id int Yes (for contains_row_id/get_score) Row ID to check membership or retrieve score for

Outputs

Name Type Description
IndexedSplit Split subclass Wrapped split with row-level filtering and scoring
data_split() Split The underlying unwrapped data split
row_ranges() List[Range] The matched row ranges
scores() Optional[List[float]] The similarity scores (None if not a vector query)
contains_row_id() bool Whether the given row ID is in the matched ranges
get_score() Optional[float] Similarity score for the given row ID, or None
row_count int Total count of matched rows across all ranges

Usage Examples

Basic Usage

from pypaimon.globalindex.indexed_split import IndexedSplit
from pypaimon.globalindex.range import Range

# After index evaluation returns matching row ranges
indexed_split = IndexedSplit(
    data_split=original_split,
    row_ranges=[Range(0, 10), Range(50, 55)],
    scores=[0.95, 0.93, 0.91, 0.90, 0.88,
            0.87, 0.85, 0.84, 0.82, 0.80,
            0.78, 0.76, 0.74, 0.72, 0.70],
)

# Check membership
print(indexed_split.contains_row_id(5))   # True  (in range 0-10)
print(indexed_split.contains_row_id(20))  # False (not in any range)
print(indexed_split.contains_row_id(52))  # True  (in range 50-55)

# Get similarity score for a matched row
score = indexed_split.get_score(5)
print(f"Similarity: {score}")

# Total matched rows across all ranges
print(f"Matched rows: {indexed_split.row_count}")  # 15 (10 + 5)

Working with Range Objects

from pypaimon.globalindex.range import Range

# Create ranges
r1 = Range(0, 100)
r2 = Range(50, 150)

# Check containment
print(r1.contains(50))   # True
print(r1.contains(100))  # False (exclusive end)

# Get count
print(r1.count())  # 100

Using IndexedSplit in Read Pipeline

from pypaimon.globalindex.indexed_split import IndexedSplit
from pypaimon.globalindex.range import Range

# Create indexed splits from index evaluation results
indexed_splits = []
for split, ranges, scores in zip(data_splits, split_ranges, split_scores):
    indexed_split = IndexedSplit(
        data_split=split,
        row_ranges=ranges,
        scores=scores,
    )
    indexed_splits.append(indexed_split)

# Pass indexed splits to table read for efficient retrieval
table_read = table.new_read()
for isplit in indexed_splits:
    # The read pipeline uses row_ranges to skip non-matching rows
    data = table_read.read(isplit)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment