Principle:Apache Paimon Vector Search Query Construction
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Vector_Search |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for constructing vector similarity search queries that specify the query vector, result limit, and target field.
Description
A vector search query encapsulates the parameters needed for approximate nearest neighbor search: the query vector (as a list of floats or numpy array), the number of results to return (top-K), and the name of the vector column to search. The query supports optional pre-filtering via include_row_ids (a RoaringBitmap of candidate rows) and range scoping via offset_range().
The query construction involves several key elements:
- Query Vector: The embedding vector to search for, automatically converted to numpy float32 format for compatibility with FAISS. Can be provided as a Python list of floats or a numpy array.
- Top-K Limit: The number of nearest neighbors to return. Must be a positive integer. Larger values return more results but require more computation.
- Field Name: The name of the vector column in the Paimon table to search against. This identifies which global index to use for the search.
- Pre-filter Bitmap: An optional RoaringBitmap64 of candidate row IDs. When provided, the vector search only considers rows in the bitmap, enabling hybrid search (predicate filtering followed by vector search).
- Range Scoping: The offset_range() method adjusts the query to operate within a specific row range, used for sharded parallel execution.
The visitor pattern (visit method) delegates the actual search to the appropriate GlobalIndexReader implementation, decoupling query specification from search execution.
Usage
Use when performing similarity search on vector-indexed Paimon tables. Construct the query with the embedding vector to search for and the desired number of nearest neighbors.
Typical construction steps:
- Create a VectorSearch with the query vector, limit, and field name.
- Optionally add pre-filtering via with_include_row_ids() for hybrid search.
- Pass the query to a GlobalIndexEvaluator or GlobalIndexScanBuilder for execution.
Theoretical Basis
K-Nearest Neighbor (KNN) Search: KNN search finds the K vectors most similar to a query vector according to a distance metric. For L2 distance, similarity decreases with distance; for inner product, similarity increases with the dot product value. The VectorSearch dataclass captures the query specification, while the actual search algorithm is delegated to the index reader via the visitor pattern.
Pre-filtering and Hybrid Search: Pre-filtering via include_row_ids enables hybrid search patterns where structured predicates (e.g., category = 'electronics') narrow the candidate set before vector similarity ranking. This is more efficient than post-filtering because the ANN index can skip vectors not in the candidate set. The effectiveness depends on the selectivity of the predicate and the index type's support for filtered search.
Visitor Pattern: The visit() method implements the visitor pattern, allowing different GlobalIndexReader implementations to handle the search differently. This decouples the query representation from the search algorithm, enabling the same query to be executed against FAISS indexes, or potentially other vector index backends, without modification.
Range-Scoped Search: The offset_range() method creates a new query scoped to a specific row range. This enables parallel execution where each shard independently processes its portion of the index and returns results within its range.