Implementation:Apache Paimon GlobalIndexScanBuilder Build
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Vector_Search |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for building global index scanners with configurable snapshot, partition, and shard scope.
Description
GlobalIndexScanBuilder is an abstract class with methods for configuring scans. GlobalIndexScanBuilderImpl provides the FileStoreTable-backed implementation. The key methods are:
- with_snapshot(): Binds the builder to a specific snapshot for consistent reads.
- with_partition_predicate(): Restricts scanning to matching partitions.
- with_row_range(): Scopes the scan to a specific row range for shard-level parallelism.
- build(): Creates a RowRangeGlobalIndexScanner for a specific row range.
- shard_list(): Returns sorted, non-overlapping Range objects for parallel scanning.
- parallel_scan(): Static method that executes sharded scans concurrently via ThreadPoolExecutor and merges results with GlobalIndexResult.or_().
The RowRangeGlobalIndexScanner produced by build() combines a GlobalIndexEvaluator with the row range scope. Its scan() method evaluates predicates and vector queries within the configured range, returning an Optional[GlobalIndexResult] containing matching row IDs.
Usage
Use GlobalIndexScanBuilder as the primary entry point for executing indexed queries against Paimon tables. The builder is obtained from the table and configured with scan parameters before execution.
Code Reference
Source Location
- Repository: Apache Paimon
- File: paimon-python/pypaimon/globalindex/global_index_scan_builder.py:L31-217
- File: paimon-python/pypaimon/globalindex/global_index_scan_builder_impl.py:L32-166
Signature
class GlobalIndexScanBuilder(ABC):
@abstractmethod
def with_snapshot(self, snapshot_or_id) -> 'GlobalIndexScanBuilder':
@abstractmethod
def with_partition_predicate(self, partition_predicate) -> 'GlobalIndexScanBuilder':
@abstractmethod
def with_row_range(self, row_range: Range) -> 'GlobalIndexScanBuilder':
@abstractmethod
def build(self) -> 'RowRangeGlobalIndexScanner':
@abstractmethod
def shard_list(self) -> List[Range]:
@staticmethod
def parallel_scan(
ranges: List[Range],
builder: 'GlobalIndexScanBuilder',
filter_predicate: Optional[Predicate],
vector_search: Optional[VectorSearch],
thread_num: Optional[int] = None,
) -> Optional[GlobalIndexResult]:
class RowRangeGlobalIndexScanner:
def scan(self, predicate, vector_search) -> Optional[GlobalIndexResult]:
def close(self) -> None:
Import
from pypaimon.globalindex.global_index_scan_builder import GlobalIndexScanBuilder
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| snapshot_or_id | Snapshot or int | Yes | Snapshot object or snapshot ID to bind the scan to a consistent point-in-time view |
| partition_predicate | Predicate | No | Partition filter predicate to restrict scanning to relevant partitions |
| row_range | Range | Yes (for build) | Row range scope for shard-level scanning |
| ranges | List[Range] | Yes (for parallel_scan) | List of non-overlapping row ranges from shard_list() |
| builder | GlobalIndexScanBuilder | Yes (for parallel_scan) | Configured scan builder instance |
| filter_predicate | Optional[Predicate] | No | Predicate filter for the parallel scan |
| vector_search | Optional[VectorSearch] | No | Vector similarity query for the parallel scan |
| thread_num | Optional[int] | No | Number of threads for parallel execution (defaults to system default) |
Outputs
| Name | Type | Description |
|---|---|---|
| build() | RowRangeGlobalIndexScanner | Scanner instance scoped to the configured row range |
| shard_list() | List[Range] | Sorted, non-overlapping row ranges for parallel scanning |
| parallel_scan() | Optional[GlobalIndexResult] | Combined result with RoaringBitmap64 of matching row IDs from all shards |
| scan() | Optional[GlobalIndexResult] | Result from a single shard scan with matching row IDs |
Usage Examples
Basic Usage
# Get scan builder from table
scan_builder = table.new_global_index_scan_builder()
scan_builder = scan_builder.with_snapshot(latest_snapshot)
# Get shard list for parallel execution
shards = scan_builder.shard_list()
# Execute parallel scan with vector search
result = GlobalIndexScanBuilder.parallel_scan(
ranges=shards,
builder=scan_builder,
filter_predicate=None,
vector_search=vector_query,
thread_num=4,
)
if result is not None:
matching_row_ids = result.results()
print(f"Found {matching_row_ids.get_count()} matching rows")
Single Shard Scan
# Build scanner for a specific row range
scan_builder = table.new_global_index_scan_builder()
scan_builder = scan_builder.with_snapshot(latest_snapshot)
scan_builder = scan_builder.with_row_range(Range(0, 1000000))
scanner = scan_builder.build()
try:
result = scanner.scan(predicate=my_filter, vector_search=my_query)
if result is not None:
print(f"Shard matched {result.results().get_count()} rows")
finally:
scanner.close()