Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Apache Paimon Vector Index Configuration Tips

From Leeroopedia




Knowledge Sources
Domains Vector_Search, Optimization
Last Updated 2026-02-08 00:00 GMT

Overview

Default FAISS index configuration uses IVF_SQ8 with 100 clusters, 64 probes, and 128-dimensional vectors for balanced recall and search speed.

Description

PyPaimon's vector similarity search relies on FAISS indexes with configurable parameters for index type, dimensionality, metric, and search behavior. The defaults represent a balanced configuration suitable for medium-scale datasets. The IVF_SQ8 index type provides scalar quantization (reducing memory by 4x vs float32) with inverted file indexing for sublinear search time. The nprobe/nlist ratio of 64/100 (64%) is aggressive, favoring recall over speed. For larger datasets, tuning these parameters is critical for achieving acceptable query latency.

Usage

Apply this heuristic when configuring vector similarity search tables or when tuning FAISS index parameters for the Vector_Similarity_Search workflow. Especially important when dealing with large embedding collections (millions of vectors) where the default parameters may not provide optimal performance.

The Insight (Rule of Thumb)

  • Index Type: Default `IVF_SQ8` (Inverted File with Scalar Quantization). Options: FLAT, HNSW, IVF, IVF_PQ, IVF_SQ8.
    • FLAT: Exact search, no approximation. Slow for large datasets.
    • HNSW: Graph-based. Good recall, higher memory.
    • IVF_SQ8: Quantized inverted index. Good balance of speed, recall, and memory.
  • Dimension: Default 128. Match to your embedding model output dimension.
  • Metric: Default L2 (Euclidean). Use INNER_PRODUCT for cosine similarity (after normalization).
  • IVF Parameters:
    • nlist=100 (clusters). Increase for larger datasets (rule of thumb: sqrt(N) for N vectors).
    • nprobe=64 (clusters probed). Higher = better recall, slower search. Default probes 64% of clusters.
  • HNSW Parameters:
    • M=32 (max connections per node). Higher = better recall, more memory.
    • ef_construction=40 (build-time candidate list). Higher = better index quality, slower build.
    • ef_search=16 (search-time candidate list). Higher = better recall, slower search.
  • Vectors Per Index: Default 2,000,000. Controls index file granularity.

Reasoning

The IVF_SQ8 default was chosen because it provides 4x memory reduction over float32 (8-bit scalar quantization) while maintaining good recall, making it practical for production use. The high nprobe/nlist ratio (64/100) suggests the project prioritizes recall quality over search speed, which is appropriate for analytical workloads where missing relevant results is more costly than slightly slower queries.

The HNSW parameters (M=32, ef_construction=40, ef_search=16) follow FAISS community best practices: M=32 is higher than the typical M=16 default, indicating a preference for recall. The ef_search=16 is intentionally lower than ef_construction=40, reflecting that search latency matters more than build latency.

The 2 million vectors per index file balances index load time against search granularity. Larger indexes amortize overhead but increase memory and load time.

Code Evidence

Vector configuration options from `pypaimon/common/options/core_options.py:337-398`:

VECTOR_DIM: ConfigOption[int] = (
    ConfigOptions.key("vector.dim")
    .int_type()
    .default_value(128)
    .with_description("The dimension of the vector.")
)

VECTOR_METRIC: ConfigOption[str] = (
    ConfigOptions.key("vector.metric")
    .string_type()
    .default_value("L2")
    .with_description("The similarity metric for vector search (L2, INNER_PRODUCT).")
)

VECTOR_INDEX_TYPE: ConfigOption[str] = (
    ConfigOptions.key("vector.index-type")
    .string_type()
    .default_value("IVF_SQ8")
    .with_description("The type of FAISS index (FLAT, HNSW, IVF, IVF_PQ, IVF_SQ8).")
)

VECTOR_M: ConfigOption[int] = (
    ConfigOptions.key("vector.m")
    .int_type()
    .default_value(32)
    .with_description("Maximum connections per element in HNSW index.")
)

VECTOR_EF_CONSTRUCTION: ConfigOption[int] = (
    ConfigOptions.key("vector.ef-construction")
    .int_type()
    .default_value(40)
    .with_description("Size of dynamic candidate list during HNSW construction.")
)

VECTOR_EF_SEARCH: ConfigOption[int] = (
    ConfigOptions.key("vector.ef-search")
    .int_type()
    .default_value(16)
    .with_description("Size of dynamic candidate list during HNSW search.")
)

VECTOR_NLIST: ConfigOption[int] = (
    ConfigOptions.key("vector.nlist")
    .int_type()
    .default_value(100)
    .with_description("Number of inverted lists (clusters) for IVF index.")
)

VECTOR_NPROBE: ConfigOption[int] = (
    ConfigOptions.key("vector.nprobe")
    .int_type()
    .default_value(64)
    .with_description("Number of clusters to visit during IVF search.")
)

VECTOR_SIZE_PER_INDEX: ConfigOption[int] = (
    ConfigOptions.key("vector.size-per-index")
    .int_type()
    .default_value(2000000)
    .with_description("Size of vectors stored in each vector index file.")
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment