Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Global Index Scan Building

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Vector_Search
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for configuring and building scanners that evaluate global indexes across table shards for efficient data retrieval.

Description

The Global Index Scan Builder orchestrates the scanning of global indexes across table data. It provides a builder pattern for configuring the scan scope: snapshot version, partition predicates, and row ranges. The builder produces RowRangeGlobalIndexScanner instances that combine index evaluation with data split mapping.

The scan building process involves several key steps:

  • Snapshot Selection: The builder binds to a specific table snapshot via with_snapshot(), ensuring consistent reads against a point-in-time view of the data.
  • Partition Scoping: Optional partition predicates via with_partition_predicate() restrict scanning to relevant partitions, avoiding full-table index evaluation.
  • Row Range Scoping: The with_row_range() method restricts the scan to a specific contiguous range of rows, enabling shard-level parallelism.
  • Shard Discovery: The shard_list() method returns sorted, non-overlapping Range objects that partition the entire index into independently scannable shards.
  • Parallel Execution: The parallel_scan() static method executes sharded scans concurrently using ThreadPoolExecutor and combines results with GlobalIndexResult.or_() (set union).

Usage

Use when performing indexed lookups (both vector search and predicate-based) on Paimon tables with global indexes. Required as the orchestration layer between index configuration and query execution.

Typical workflow:

  1. Obtain a scan builder from the table via new_global_index_scan_builder().
  2. Configure the builder with the desired snapshot and optional partition predicate.
  3. Call shard_list() to discover available shards for parallel execution.
  4. Call parallel_scan() with the shard list, builder, and query parameters (predicate and/or vector search).
  5. Use the returned GlobalIndexResult to retrieve matching row IDs.

Theoretical Basis

Shard-based scanning partitions the index evaluation workload across independent ranges, enabling linear speedup with parallelism. The builder pattern ensures correct configuration before expensive scan operations.

Key theoretical concepts:

Sharded Index Evaluation: By partitioning the row space into non-overlapping ranges, each shard can be evaluated independently without coordination. This enables embarrassingly parallel execution where each thread handles a disjoint subset of the index.

Result Combination: The scanner combines index results from multiple shards using set union (OR) operations on RoaringBitmap result sets. Since shards are non-overlapping, the union operation is equivalent to concatenation and produces no duplicates.

Snapshot Isolation: Binding to a specific snapshot ensures that the index evaluation sees a consistent view of the data, even as concurrent writes modify the table. This is critical for correctness in the presence of concurrent compactions and writes.

Builder Pattern: The builder pattern provides a fluent API for configuring scan parameters before executing the scan. This ensures all required parameters are set before the expensive operation begins, and enables reuse of the builder configuration across multiple scans with different row ranges.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment