Implementation:Apache Paimon IndexedSplit Result Retrieval
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Vector_Search |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for wrapping data splits with index results including row ranges and similarity scores.
Description
IndexedSplit extends Split and wraps a data split with row_ranges (List[Range]) and optional scores (List[float]). It provides efficient row-level access to index results:
- contains_row_id(): Checks if a given row ID falls within any of the matched row ranges.
- get_score(): Retrieves the similarity score for a specific row ID, returning None if the row is not in the matched ranges or if scores are not available.
- row_count: Property that returns the total count of matched rows across all row ranges.
IndexedSplit delegates all standard Split properties (files, partition, bucket) to the underlying data_split, maintaining full compatibility with the read pipeline.
The Range class provides foundational range operations:
- contains(): Tests whether a value falls within the range bounds (inclusive start, exclusive end).
- count(): Returns the number of elements in the range.
- merge(): Combines overlapping or adjacent ranges into a single range.
Usage
Use IndexedSplit after index evaluation to wrap data splits with row-level filtering information. Pass the resulting IndexedSplit instances to the read pipeline for efficient skip-scan retrieval.
Code Reference
Source Location
- Repository: Apache Paimon
- File: paimon-python/pypaimon/globalindex/indexed_split.py:L28-142
- File: paimon-python/pypaimon/globalindex/range.py:L25-191
Signature
class IndexedSplit(Split):
def __init__(
self,
data_split: Split,
row_ranges: List[Range],
scores: Optional[List[float]] = None,
):
def data_split(self) -> Split:
def row_ranges(self) -> List[Range]:
def scores(self) -> Optional[List[float]]:
def contains_row_id(self, row_id: int) -> bool:
def get_score(self, row_id: int) -> Optional[float]:
@property
def row_count(self) -> int:
class Range:
def __init__(self, from_: int, to: int):
def contains(self, value: int) -> bool:
def count(self) -> int:
Import
from pypaimon.globalindex.indexed_split import IndexedSplit
from pypaimon.globalindex.range import Range
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_split | Split | Yes | The underlying data split containing the physical file references |
| row_ranges | List[Range] | Yes | List of contiguous row ranges that matched the index query |
| scores | Optional[List[float]] | No | Per-row similarity scores from vector search (aligned with row ranges) |
| row_id | int | Yes (for contains_row_id/get_score) | Row ID to check membership or retrieve score for |
Outputs
| Name | Type | Description |
|---|---|---|
| IndexedSplit | Split subclass | Wrapped split with row-level filtering and scoring |
| data_split() | Split | The underlying unwrapped data split |
| row_ranges() | List[Range] | The matched row ranges |
| scores() | Optional[List[float]] | The similarity scores (None if not a vector query) |
| contains_row_id() | bool | Whether the given row ID is in the matched ranges |
| get_score() | Optional[float] | Similarity score for the given row ID, or None |
| row_count | int | Total count of matched rows across all ranges |
Usage Examples
Basic Usage
from pypaimon.globalindex.indexed_split import IndexedSplit
from pypaimon.globalindex.range import Range
# After index evaluation returns matching row ranges
indexed_split = IndexedSplit(
data_split=original_split,
row_ranges=[Range(0, 10), Range(50, 55)],
scores=[0.95, 0.93, 0.91, 0.90, 0.88,
0.87, 0.85, 0.84, 0.82, 0.80,
0.78, 0.76, 0.74, 0.72, 0.70],
)
# Check membership
print(indexed_split.contains_row_id(5)) # True (in range 0-10)
print(indexed_split.contains_row_id(20)) # False (not in any range)
print(indexed_split.contains_row_id(52)) # True (in range 50-55)
# Get similarity score for a matched row
score = indexed_split.get_score(5)
print(f"Similarity: {score}")
# Total matched rows across all ranges
print(f"Matched rows: {indexed_split.row_count}") # 15 (10 + 5)
Working with Range Objects
from pypaimon.globalindex.range import Range
# Create ranges
r1 = Range(0, 100)
r2 = Range(50, 150)
# Check containment
print(r1.contains(50)) # True
print(r1.contains(100)) # False (exclusive end)
# Get count
print(r1.count()) # 100
Using IndexedSplit in Read Pipeline
from pypaimon.globalindex.indexed_split import IndexedSplit
from pypaimon.globalindex.range import Range
# Create indexed splits from index evaluation results
indexed_splits = []
for split, ranges, scores in zip(data_splits, split_ranges, split_scores):
indexed_split = IndexedSplit(
data_split=split,
row_ranges=ranges,
scores=scores,
)
indexed_splits.append(indexed_split)
# Pass indexed splits to table read for efficient retrieval
table_read = table.new_read()
for isplit in indexed_splits:
# The read pipeline uses row_ranges to skip non-matching rows
data = table_read.read(isplit)