Principle:Apache Paimon Indexed Split Result Retrieval
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Vector_Search |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for mapping global index results back to data splits with row ranges and similarity scores for final data retrieval.
Description
After global index evaluation produces matching row IDs, these must be mapped back to the physical data splits that contain those rows. IndexedSplit wraps a regular Split with row ranges (indicating which rows within the split matched) and optional similarity scores.
The mapping process works as follows:
- Row ID to Split Mapping: Global row IDs are partitioned across data splits. Each split covers a contiguous range of row IDs. The index results (RoaringBitmap of matching row IDs) are intersected with each split's row range to determine which splits contain matches.
- Row Ranges: Range objects specify contiguous row ranges within a split that matched the query. Multiple ranges per split are possible when matches are non-contiguous.
- Similarity Scores: For vector search queries, per-row similarity scores are provided alongside row ranges, enabling result ranking by relevance.
- Skip-Scan Access: IndexedSplit enables the read pipeline to efficiently skip non-matching rows within each split, reading only the rows identified by the index.
IndexedSplit delegates all standard Split properties (files, partition, bucket) to the underlying data split, maintaining compatibility with the standard read pipeline.
Usage
Use after index evaluation to create targeted splits that only read matching rows, then feed these to TableRead for final data retrieval.
Typical workflow:
- Execute index evaluation to obtain GlobalIndexResult with matching row IDs.
- Map row IDs to data splits, creating IndexedSplit instances with row ranges and optional scores.
- Pass IndexedSplit instances to the read pipeline for efficient row retrieval.
- Use get_score() to rank results by similarity when performing vector search.
Theoretical Basis
Index-Assisted Retrieval: Index-assisted retrieval maps from logical row IDs (from the index) to physical data locations (file splits with offsets). The IndexedSplit acts as a filtered view of the underlying split, enabling skip-scan access patterns that read only the rows identified by the index.
Skip-Scan Optimization: Traditional full-scan reads every row in a split. With IndexedSplit, the reader can seek to the specific row ranges that contain matches, skipping large portions of the data. This is especially effective when the index matches a small fraction of the total rows, turning O(n) scans into O(k) reads where k is the number of matched rows.
Score-Based Ranking: For vector similarity queries, the scores associated with each matched row enable top-K ranking across multiple splits. After collecting IndexedSplit instances from all shards, the scores can be used to select the globally top-K results by merging sorted score lists.
Range Representation: The Range class represents contiguous row ranges with a start (inclusive) and end (exclusive) boundary. This compact representation is efficient for both storage and membership testing (O(1) per range via bounds comparison). Multiple ranges per split handle non-contiguous matches without materializing individual row IDs.