Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon GlobalIndexScanBuilder Build

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Vector_Search
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for building global index scanners with configurable snapshot, partition, and shard scope.

Description

GlobalIndexScanBuilder is an abstract class with methods for configuring scans. GlobalIndexScanBuilderImpl provides the FileStoreTable-backed implementation. The key methods are:

  • with_snapshot(): Binds the builder to a specific snapshot for consistent reads.
  • with_partition_predicate(): Restricts scanning to matching partitions.
  • with_row_range(): Scopes the scan to a specific row range for shard-level parallelism.
  • build(): Creates a RowRangeGlobalIndexScanner for a specific row range.
  • shard_list(): Returns sorted, non-overlapping Range objects for parallel scanning.
  • parallel_scan(): Static method that executes sharded scans concurrently via ThreadPoolExecutor and merges results with GlobalIndexResult.or_().

The RowRangeGlobalIndexScanner produced by build() combines a GlobalIndexEvaluator with the row range scope. Its scan() method evaluates predicates and vector queries within the configured range, returning an Optional[GlobalIndexResult] containing matching row IDs.

Usage

Use GlobalIndexScanBuilder as the primary entry point for executing indexed queries against Paimon tables. The builder is obtained from the table and configured with scan parameters before execution.

Code Reference

Source Location

  • Repository: Apache Paimon
  • File: paimon-python/pypaimon/globalindex/global_index_scan_builder.py:L31-217
  • File: paimon-python/pypaimon/globalindex/global_index_scan_builder_impl.py:L32-166

Signature

class GlobalIndexScanBuilder(ABC):
    @abstractmethod
    def with_snapshot(self, snapshot_or_id) -> 'GlobalIndexScanBuilder':
    @abstractmethod
    def with_partition_predicate(self, partition_predicate) -> 'GlobalIndexScanBuilder':
    @abstractmethod
    def with_row_range(self, row_range: Range) -> 'GlobalIndexScanBuilder':
    @abstractmethod
    def build(self) -> 'RowRangeGlobalIndexScanner':
    @abstractmethod
    def shard_list(self) -> List[Range]:

    @staticmethod
    def parallel_scan(
        ranges: List[Range],
        builder: 'GlobalIndexScanBuilder',
        filter_predicate: Optional[Predicate],
        vector_search: Optional[VectorSearch],
        thread_num: Optional[int] = None,
    ) -> Optional[GlobalIndexResult]:

class RowRangeGlobalIndexScanner:
    def scan(self, predicate, vector_search) -> Optional[GlobalIndexResult]:
    def close(self) -> None:

Import

from pypaimon.globalindex.global_index_scan_builder import GlobalIndexScanBuilder

I/O Contract

Inputs

Name Type Required Description
snapshot_or_id Snapshot or int Yes Snapshot object or snapshot ID to bind the scan to a consistent point-in-time view
partition_predicate Predicate No Partition filter predicate to restrict scanning to relevant partitions
row_range Range Yes (for build) Row range scope for shard-level scanning
ranges List[Range] Yes (for parallel_scan) List of non-overlapping row ranges from shard_list()
builder GlobalIndexScanBuilder Yes (for parallel_scan) Configured scan builder instance
filter_predicate Optional[Predicate] No Predicate filter for the parallel scan
vector_search Optional[VectorSearch] No Vector similarity query for the parallel scan
thread_num Optional[int] No Number of threads for parallel execution (defaults to system default)

Outputs

Name Type Description
build() RowRangeGlobalIndexScanner Scanner instance scoped to the configured row range
shard_list() List[Range] Sorted, non-overlapping row ranges for parallel scanning
parallel_scan() Optional[GlobalIndexResult] Combined result with RoaringBitmap64 of matching row IDs from all shards
scan() Optional[GlobalIndexResult] Result from a single shard scan with matching row IDs

Usage Examples

Basic Usage

# Get scan builder from table
scan_builder = table.new_global_index_scan_builder()
scan_builder = scan_builder.with_snapshot(latest_snapshot)

# Get shard list for parallel execution
shards = scan_builder.shard_list()

# Execute parallel scan with vector search
result = GlobalIndexScanBuilder.parallel_scan(
    ranges=shards,
    builder=scan_builder,
    filter_predicate=None,
    vector_search=vector_query,
    thread_num=4,
)

if result is not None:
    matching_row_ids = result.results()
    print(f"Found {matching_row_ids.get_count()} matching rows")

Single Shard Scan

# Build scanner for a specific row range
scan_builder = table.new_global_index_scan_builder()
scan_builder = scan_builder.with_snapshot(latest_snapshot)
scan_builder = scan_builder.with_row_range(Range(0, 1000000))

scanner = scan_builder.build()
try:
    result = scanner.scan(predicate=my_filter, vector_search=my_query)
    if result is not None:
        print(f"Shard matched {result.results().get_count()} rows")
finally:
    scanner.close()

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment