Workflow:Apache Paimon Vector Similarity Search

Knowledge Sources	Apache Paimon Paimon Documentation FAISS Documentation
Domains	Data_Lake, Vector_Search, AI_ML
Last Updated	2026-02-07 23:00 GMT

Overview

End-to-end process for performing approximate nearest neighbor (ANN) vector similarity search on Paimon tables using the FAISS global index infrastructure, enabling AI/ML retrieval workloads directly on data lake tables.

Description

This workflow integrates Facebook's FAISS library with Paimon's global index system to support vector similarity search on table data. The global index stores precomputed FAISS indexes alongside the table's data files, enabling efficient top-k nearest neighbor queries without full table scans. The system supports multiple FAISS index types (Flat, IVF, HNSW), distance metrics (L2, inner product), and optional predicate-based pre-filtering using bloom filters. Queries return IndexedSplits that combine matching row ranges with similarity scores, which can then be read to retrieve the corresponding table rows.

Usage

Execute this workflow when you need to find similar items in a Paimon table based on vector embeddings (text embeddings, image features, recommendation vectors). This is the recommended approach for retrieval-augmented generation (RAG), recommendation systems, and semantic search applications built on Paimon data lakes.

Execution Steps

Step 1: Table and Index Configuration

Create or configure a Paimon table with a vector column and enable the global index feature. Define the FAISS index type (Flat for exact search, IVF for large-scale approximate search, HNSW for graph-based search), the distance metric (L2 Euclidean or inner product), and index parameters (nlist for IVF, M and ef_construction for HNSW). The vector column stores fixed-dimension float arrays.

Key considerations:

Choose index type based on dataset size and latency requirements
Flat index is exact but slow for large datasets
IVF provides good accuracy-speed tradeoff for millions of vectors
HNSW offers fast search with tunable recall via ef_search parameter
Vector dimensions must be consistent across all rows

Step 2: Global Index Scan Builder Configuration

Create a GlobalIndexScanBuilder from the table, specifying the target snapshot, optional partition predicates, and row range constraints. The builder configures how the index files will be scanned. For parallel execution, multiple row ranges can be evaluated concurrently using thread pools.

Key considerations:

Snapshot selection determines which version of the index to query
Partition predicates narrow the search to specific data partitions
Row range constraints enable parallel index evaluation across shards
The builder produces RowRangeGlobalIndexScanner instances

Step 3: Vector Search Query Construction

Construct a VectorSearch query specifying the query vector, the number of results to return (top-k), and optional search parameters. The query vector must match the dimension of the indexed vectors. Search parameters tune the accuracy-speed tradeoff at query time (ef_search for HNSW, nprobe for IVF).

Key considerations:

Query vector dimension must match the index dimension
Top-k parameter controls how many nearest neighbors are returned
Higher ef_search or nprobe values improve recall at the cost of latency
L2 normalization can be applied to query vectors automatically

Step 4: Index Evaluation and Scoring

The GlobalIndexEvaluator loads relevant FAISS index files based on bloom filter pre-checks and row range overlap. For each loaded index, it performs the ANN search, converts raw distances to scores (1/(1+distance) for L2, raw value for inner product), and filters results against any predicate constraints. Results are merged across index files and ranked by score.

Key considerations:

Index files are loaded lazily and cached for repeated queries
Bloom filters enable skipping irrelevant index files
Distance-to-score conversion normalizes results across index types
Row ID filtering integrates with predicate-based constraints

Step 5: Result Retrieval via Indexed Splits

The search produces IndexedSplit objects that wrap standard Paimon splits with additional row range and score information. Use the table's read pipeline to read data from these indexed splits, which automatically applies row-range filtering to return only the matching rows along with their similarity scores.

Key considerations:

IndexedSplits contain both location (split) and relevance (score) information
The reader applies row-range filters to avoid reading unnecessary rows
Results are ordered by similarity score (highest first)
Standard column projection and predicate pushdown can be applied on top

Execution Diagram

GitHub URL

Workflow Repository