Implementation:Apache Paimon GlobalIndexScanBuilder Build

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Vector_Search
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for building global index scanners with configurable snapshot, partition, and shard scope.

Description

GlobalIndexScanBuilder is an abstract class with methods for configuring scans. GlobalIndexScanBuilderImpl provides the FileStoreTable-backed implementation. The key methods are:

with_snapshot(): Binds the builder to a specific snapshot for consistent reads.
with_partition_predicate(): Restricts scanning to matching partitions.
with_row_range(): Scopes the scan to a specific row range for shard-level parallelism.
build(): Creates a RowRangeGlobalIndexScanner for a specific row range.
shard_list(): Returns sorted, non-overlapping Range objects for parallel scanning.
parallel_scan(): Static method that executes sharded scans concurrently via ThreadPoolExecutor and merges results with GlobalIndexResult.or_().

The RowRangeGlobalIndexScanner produced by build() combines a GlobalIndexEvaluator with the row range scope. Its scan() method evaluates predicates and vector queries within the configured range, returning an Optional[GlobalIndexResult] containing matching row IDs.

Usage

Use GlobalIndexScanBuilder as the primary entry point for executing indexed queries against Paimon tables. The builder is obtained from the table and configured with scan parameters before execution.

Code Reference

Source Location

Repository: Apache Paimon
File: paimon-python/pypaimon/globalindex/global_index_scan_builder.py:L31-217
File: paimon-python/pypaimon/globalindex/global_index_scan_builder_impl.py:L32-166

Signature

class GlobalIndexScanBuilder(ABC):
    @abstractmethod
    def with_snapshot(self, snapshot_or_id) -> 'GlobalIndexScanBuilder':
    @abstractmethod
    def with_partition_predicate(self, partition_predicate) -> 'GlobalIndexScanBuilder':
    @abstractmethod
    def with_row_range(self, row_range: Range) -> 'GlobalIndexScanBuilder':
    @abstractmethod
    def build(self) -> 'RowRangeGlobalIndexScanner':
    @abstractmethod
    def shard_list(self) -> List[Range]:

    @staticmethod
    def parallel_scan(
        ranges: List[Range],
        builder: 'GlobalIndexScanBuilder',
        filter_predicate: Optional[Predicate],
        vector_search: Optional[VectorSearch],
        thread_num: Optional[int] = None,
    ) -> Optional[GlobalIndexResult]:

class RowRangeGlobalIndexScanner:
    def scan(self, predicate, vector_search) -> Optional[GlobalIndexResult]:
    def close(self) -> None:

Import

from pypaimon.globalindex.global_index_scan_builder import GlobalIndexScanBuilder

I/O Contract

Inputs

Name	Type	Required	Description
snapshot_or_id	Snapshot or int	Yes	Snapshot object or snapshot ID to bind the scan to a consistent point-in-time view
partition_predicate	Predicate	No	Partition filter predicate to restrict scanning to relevant partitions
row_range	Range	Yes (for build)	Row range scope for shard-level scanning
ranges	List[Range]	Yes (for parallel_scan)	List of non-overlapping row ranges from shard_list()
builder	GlobalIndexScanBuilder	Yes (for parallel_scan)	Configured scan builder instance
filter_predicate	Optional[Predicate]	No	Predicate filter for the parallel scan
vector_search	Optional[VectorSearch]	No	Vector similarity query for the parallel scan
thread_num	Optional[int]	No	Number of threads for parallel execution (defaults to system default)

Outputs

Name	Type	Description
build()	RowRangeGlobalIndexScanner	Scanner instance scoped to the configured row range
shard_list()	List[Range]	Sorted, non-overlapping row ranges for parallel scanning
parallel_scan()	Optional[GlobalIndexResult]	Combined result with RoaringBitmap64 of matching row IDs from all shards
scan()	Optional[GlobalIndexResult]	Result from a single shard scan with matching row IDs

Usage Examples

Basic Usage

# Get scan builder from table
scan_builder = table.new_global_index_scan_builder()
scan_builder = scan_builder.with_snapshot(latest_snapshot)

# Get shard list for parallel execution
shards = scan_builder.shard_list()

# Execute parallel scan with vector search
result = GlobalIndexScanBuilder.parallel_scan(
    ranges=shards,
    builder=scan_builder,
    filter_predicate=None,
    vector_search=vector_query,
    thread_num=4,
)

if result is not None:
    matching_row_ids = result.results()
    print(f"Found {matching_row_ids.get_count()} matching rows")

Single Shard Scan

# Build scanner for a specific row range
scan_builder = table.new_global_index_scan_builder()
scan_builder = scan_builder.with_snapshot(latest_snapshot)
scan_builder = scan_builder.with_row_range(Range(0, 1000000))

scanner = scan_builder.build()
try:
    result = scanner.scan(predicate=my_filter, vector_search=my_query)
    if result is not None:
        print(f"Shard matched {result.results().get_count()} rows")
finally:
    scanner.close()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment