Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Indexed Split Result Retrieval

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Vector_Search
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for mapping global index results back to data splits with row ranges and similarity scores for final data retrieval.

Description

After global index evaluation produces matching row IDs, these must be mapped back to the physical data splits that contain those rows. IndexedSplit wraps a regular Split with row ranges (indicating which rows within the split matched) and optional similarity scores.

The mapping process works as follows:

  • Row ID to Split Mapping: Global row IDs are partitioned across data splits. Each split covers a contiguous range of row IDs. The index results (RoaringBitmap of matching row IDs) are intersected with each split's row range to determine which splits contain matches.
  • Row Ranges: Range objects specify contiguous row ranges within a split that matched the query. Multiple ranges per split are possible when matches are non-contiguous.
  • Similarity Scores: For vector search queries, per-row similarity scores are provided alongside row ranges, enabling result ranking by relevance.
  • Skip-Scan Access: IndexedSplit enables the read pipeline to efficiently skip non-matching rows within each split, reading only the rows identified by the index.

IndexedSplit delegates all standard Split properties (files, partition, bucket) to the underlying data split, maintaining compatibility with the standard read pipeline.

Usage

Use after index evaluation to create targeted splits that only read matching rows, then feed these to TableRead for final data retrieval.

Typical workflow:

  1. Execute index evaluation to obtain GlobalIndexResult with matching row IDs.
  2. Map row IDs to data splits, creating IndexedSplit instances with row ranges and optional scores.
  3. Pass IndexedSplit instances to the read pipeline for efficient row retrieval.
  4. Use get_score() to rank results by similarity when performing vector search.

Theoretical Basis

Index-Assisted Retrieval: Index-assisted retrieval maps from logical row IDs (from the index) to physical data locations (file splits with offsets). The IndexedSplit acts as a filtered view of the underlying split, enabling skip-scan access patterns that read only the rows identified by the index.

Skip-Scan Optimization: Traditional full-scan reads every row in a split. With IndexedSplit, the reader can seek to the specific row ranges that contain matches, skipping large portions of the data. This is especially effective when the index matches a small fraction of the total rows, turning O(n) scans into O(k) reads where k is the number of matched rows.

Score-Based Ranking: For vector similarity queries, the scores associated with each matched row enable top-K ranking across multiple splits. After collecting IndexedSplit instances from all shards, the scores can be used to select the globally top-K results by merging sorted score lists.

Range Representation: The Range class represents contiguous row ranges with a start (inclusive) and end (exclusive) boundary. This compact representation is efficient for both storage and membership testing (O(1) per range via bounds comparison). Multiple ranges per split handle non-contiguous matches without materializing individual row IDs.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment