Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Neuml Txtai Faiss Index Sizing Tip

From Leeroopedia



Knowledge Sources
Domains Optimization, Semantic_Search, Vector_Indexing
Last Updated 2026-02-09 17:00 GMT

Overview

Automatic Faiss index structure selection based on dataset size: flat storage for small indexes (<=5000 documents), IVF clustering for larger datasets, with computed cell counts requiring at least 39 points per cluster.

Description

txtai automatically selects the optimal Faiss index structure based on the number of documents being indexed. For small datasets (5000 documents or fewer), it uses a flat `IDMap` index with optional scalar quantization, which provides exact nearest neighbor search without the overhead of clustering. For larger datasets, it switches to an Inverted File (IVF) index with automatically computed cell counts. The number of IVF cells follows the formula `min(4 * sqrt(N), N / 39)`, which ensures each cluster has enough training points (Faiss requires at least 39 points per cluster). The search probe count (nprobe) is also automatically tuned: 6 for small indexes, or `cells / 16` for larger ones.

Usage

This heuristic applies automatically when building a Faiss-backed embeddings index. Understanding it helps when tuning search accuracy vs speed: increasing nprobe improves recall at the cost of latency. For datasets under 5000 documents, no tuning is needed as the flat index provides exact search.

The Insight (Rule of Thumb)

  • Action: Let txtai auto-select the Faiss index structure, or override with custom `components` configuration.
  • Threshold: Datasets with <= 5000 documents use flat storage (exact search). Larger datasets use IVF clustering.
  • IVF Cell Formula: `cells = min(4 * sqrt(count), count / 39)`, minimum 1 cell.
  • nprobe Defaults: 6 for small indexes (<=5000), `cells / 16` for larger indexes.
  • Trade-off: IVF is faster but approximate; flat is exact but slower at scale. Higher nprobe increases accuracy but reduces speed.
  • Quantization: Default scalar quantization is SQ8 (8-bit). Boolean `quantize=True` maps to SQ8.

Reasoning

Faiss IVF indexes partition vectors into Voronoi cells and only search a subset (nprobe) at query time, making search sublinear. However, clustering requires sufficient training data — Faiss needs at least 39 points per cluster to train stable centroids. The `4 * sqrt(N)` formula comes from Faiss documentation guidelines for balancing cluster count with dataset size. For small datasets, the overhead of building and searching IVF clusters exceeds the benefit, so flat storage is preferred. The nprobe ratio of `cells / 16` searches approximately 6% of cells, providing a good accuracy-speed balance.

# From src/python/txtai/ann/dense/faiss.py:138-145
# Small index, use storage directly with IDMap
if count <= 5000:
    return "BFlat" if self.qbits else f"IDMap,{storage}"

x = self.cells(train)
components = f"BIVF{x}" if self.qbits else f"IVF{x},{storage}"
# From src/python/txtai/ann/dense/faiss.py:183-185
# Calculate number of IVF cells where x = min(4 * sqrt(embeddings count), embeddings count / 39)
# Faiss requires at least 39 points per cluster
return max(min(round(4 * math.sqrt(count)), int(count / 39)), 1)
# From src/python/txtai/ann/dense/faiss.py:219
default = 6 if count <= 5000 else round(self.cells(count) / 16)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment