Workflow:Apache Paimon Vector Similarity Search
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Vector_Search, AI_ML |
| Last Updated | 2026-02-07 23:00 GMT |
Overview
End-to-end process for performing approximate nearest neighbor (ANN) vector similarity search on Paimon tables using the FAISS global index infrastructure, enabling AI/ML retrieval workloads directly on data lake tables.
Description
This workflow integrates Facebook's FAISS library with Paimon's global index system to support vector similarity search on table data. The global index stores precomputed FAISS indexes alongside the table's data files, enabling efficient top-k nearest neighbor queries without full table scans. The system supports multiple FAISS index types (Flat, IVF, HNSW), distance metrics (L2, inner product), and optional predicate-based pre-filtering using bloom filters. Queries return IndexedSplits that combine matching row ranges with similarity scores, which can then be read to retrieve the corresponding table rows.
Usage
Execute this workflow when you need to find similar items in a Paimon table based on vector embeddings (text embeddings, image features, recommendation vectors). This is the recommended approach for retrieval-augmented generation (RAG), recommendation systems, and semantic search applications built on Paimon data lakes.
Execution Steps
Step 1: Table and Index Configuration
Create or configure a Paimon table with a vector column and enable the global index feature. Define the FAISS index type (Flat for exact search, IVF for large-scale approximate search, HNSW for graph-based search), the distance metric (L2 Euclidean or inner product), and index parameters (nlist for IVF, M and ef_construction for HNSW). The vector column stores fixed-dimension float arrays.
Key considerations:
- Choose index type based on dataset size and latency requirements
- Flat index is exact but slow for large datasets
- IVF provides good accuracy-speed tradeoff for millions of vectors
- HNSW offers fast search with tunable recall via ef_search parameter
- Vector dimensions must be consistent across all rows
Step 2: Global Index Scan Builder Configuration
Create a GlobalIndexScanBuilder from the table, specifying the target snapshot, optional partition predicates, and row range constraints. The builder configures how the index files will be scanned. For parallel execution, multiple row ranges can be evaluated concurrently using thread pools.
Key considerations:
- Snapshot selection determines which version of the index to query
- Partition predicates narrow the search to specific data partitions
- Row range constraints enable parallel index evaluation across shards
- The builder produces RowRangeGlobalIndexScanner instances
Step 3: Vector Search Query Construction
Construct a VectorSearch query specifying the query vector, the number of results to return (top-k), and optional search parameters. The query vector must match the dimension of the indexed vectors. Search parameters tune the accuracy-speed tradeoff at query time (ef_search for HNSW, nprobe for IVF).
Key considerations:
- Query vector dimension must match the index dimension
- Top-k parameter controls how many nearest neighbors are returned
- Higher ef_search or nprobe values improve recall at the cost of latency
- L2 normalization can be applied to query vectors automatically
Step 4: Index Evaluation and Scoring
The GlobalIndexEvaluator loads relevant FAISS index files based on bloom filter pre-checks and row range overlap. For each loaded index, it performs the ANN search, converts raw distances to scores (1/(1+distance) for L2, raw value for inner product), and filters results against any predicate constraints. Results are merged across index files and ranked by score.
Key considerations:
- Index files are loaded lazily and cached for repeated queries
- Bloom filters enable skipping irrelevant index files
- Distance-to-score conversion normalizes results across index types
- Row ID filtering integrates with predicate-based constraints
Step 5: Result Retrieval via Indexed Splits
The search produces IndexedSplit objects that wrap standard Paimon splits with additional row range and score information. Use the table's read pipeline to read data from these indexed splits, which automatically applies row-range filtering to return only the matching rows along with their similarity scores.
Key considerations:
- IndexedSplits contain both location (split) and relevance (score) information
- The reader applies row-range filters to avoid reading unnecessary rows
- Results are ordered by similarity score (highest first)
- Standard column projection and predicate pushdown can be applied on top