Implementation:Apache Paimon FaissVectorIndexOptions Configuration
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Vector_Search |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for configuring FAISS vector index parameters in Paimon tables.
Description
FaissVectorIndexOptions is a dataclass that encapsulates all FAISS index configuration. It supports 5 index types (FLAT, HNSW, IVF, IVF_PQ, IVF_SQ8), 2 distance metrics (L2, INNER_PRODUCT), and parameters for each index type. The from_options() class method creates the configuration from a dictionary of table options. The options are stored as table properties with the vector. prefix.
Supporting enums include:
- FaissVectorMetric: Defines the distance metric used for similarity computation (L2 for Euclidean distance, INNER_PRODUCT for dot product / cosine similarity).
- FaissIndexType: Defines the ANN algorithm used for indexing (FLAT, HNSW, IVF, IVF_PQ, IVF_SQ8).
The dataclass provides sensible defaults for all parameters, making it easy to get started while allowing fine-grained control for production workloads. The to_dict() method serializes the configuration back to a dictionary with vector. prefixed keys for storage as table properties.
Usage
Use FaissVectorIndexOptions when configuring vector indexes on Paimon tables. The configuration can be created either directly via the constructor or from a dictionary of table options using the from_options() class method.
Code Reference
Source Location
- Repository: Apache Paimon
- File: paimon-python/pypaimon/globalindex/faiss/faiss_options.py:L26-120
Signature
class FaissVectorMetric(Enum):
L2 = "L2"
INNER_PRODUCT = "INNER_PRODUCT"
class FaissIndexType(Enum):
FLAT = "FLAT"
HNSW = "HNSW"
IVF = "IVF"
IVF_PQ = "IVF_PQ"
IVF_SQ8 = "IVF_SQ8"
@dataclass
class FaissVectorIndexOptions:
dimension: int = 128
metric: FaissVectorMetric = FaissVectorMetric.L2
index_type: FaissIndexType = FaissIndexType.IVF_SQ8
m: int = 32 # HNSW connections per layer
ef_construction: int = 40 # HNSW construction parameter
ef_search: int = 16 # HNSW search parameter
nlist: int = 100 # IVF cluster count
nprobe: int = 64 # IVF search breadth
pq_m: int = 8 # PQ sub-quantizers
pq_nbits: int = 8 # PQ bits per sub-quantizer
size_per_index: int = 2000000 # Vectors per index shard
training_size: int = 500000 # Vectors for IVF training
search_factor: int = 10 # Search multiplier for filtering
normalize: bool = False # L2 normalize vectors
@classmethod
def from_options(cls, options: Dict[str, Any]) -> 'FaissVectorIndexOptions':
def to_dict(self) -> Dict[str, Any]:
Import
from pypaimon.globalindex.faiss.faiss_options import (
FaissVectorIndexOptions, FaissVectorMetric, FaissIndexType
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| options | Dict[str, Any] | Yes (for from_options) | Dictionary with 'vector.' prefixed keys (e.g., 'vector.dim', 'vector.metric') |
| dimension | int | No (default 128) | Dimensionality of the embedding vectors |
| metric | FaissVectorMetric | No (default L2) | Distance metric for similarity computation (L2 or INNER_PRODUCT) |
| index_type | FaissIndexType | No (default IVF_SQ8) | ANN index algorithm to use |
| m | int | No (default 32) | HNSW: number of connections per layer |
| ef_construction | int | No (default 40) | HNSW: construction-time search depth |
| ef_search | int | No (default 16) | HNSW: query-time search depth |
| nlist | int | No (default 100) | IVF: number of Voronoi cells (clusters) |
| nprobe | int | No (default 64) | IVF: number of cells to search at query time |
| pq_m | int | No (default 8) | IVF_PQ: number of sub-quantizers |
| pq_nbits | int | No (default 8) | IVF_PQ: bits per sub-quantizer |
| size_per_index | int | No (default 2000000) | Maximum vectors per index shard |
| training_size | int | No (default 500000) | Number of vectors used for IVF training |
| search_factor | int | No (default 10) | Multiplier applied to limit for pre-filtering scenarios |
| normalize | bool | No (default False) | Whether to L2 normalize vectors before indexing |
Outputs
| Name | Type | Description |
|---|---|---|
| FaissVectorIndexOptions | dataclass | Fully configured FAISS index options instance |
| to_dict() | Dict[str, Any] | Serialized options with 'vector.' prefixed keys for table property storage |
Usage Examples
Basic Usage
from pypaimon.globalindex.faiss.faiss_options import (
FaissVectorIndexOptions, FaissVectorMetric, FaissIndexType
)
# Configure from table options dictionary
options = {
'vector.dim': 768,
'vector.metric': 'INNER_PRODUCT',
'vector.index-type': 'HNSW',
'vector.ef-search': 64,
'vector.m': 32,
}
faiss_options = FaissVectorIndexOptions.from_options(options)
# Or create directly with constructor
faiss_options = FaissVectorIndexOptions(
dimension=768,
metric=FaissVectorMetric.INNER_PRODUCT,
index_type=FaissIndexType.HNSW,
ef_search=64,
)
# Serialize back to table properties
props = faiss_options.to_dict()
# {'vector.dim': 768, 'vector.metric': 'INNER_PRODUCT', ...}
IVF_PQ Configuration for Large Datasets
# Configure IVF_PQ for billion-scale dataset with memory constraints
faiss_options = FaissVectorIndexOptions(
dimension=256,
metric=FaissVectorMetric.L2,
index_type=FaissIndexType.IVF_PQ,
nlist=1024,
nprobe=32,
pq_m=16,
pq_nbits=8,
size_per_index=5000000,
training_size=1000000,
)