Heuristic:AnswerDotAI RAGatouille FAISS Vs PyTorch KMeans Indexing
| Knowledge Sources | |
|---|---|
| Domains | Indexing, Optimization, Information_Retrieval |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
Decision framework for choosing between PyTorch KMeans (default, more compatible) and FAISS KMeans (faster, requires faiss-gpu) when building PLAID indexes.
Description
RAGatouille v0.8.0+ introduced a PyTorch-based KMeans replacement for FAISS during index building. This replacement is activated by default for collections under 75,000 documents when `use_faiss=False` (the default). The PyTorch implementation monkey-patches `CollectionIndexer._train_kmeans` at runtime to avoid requiring FAISS entirely, improving cross-platform compatibility. However, for large collections, FAISS (especially faiss-gpu) is substantially faster and produces better centroid quality.
Usage
Use this heuristic when deciding how to configure the `use_faiss` parameter in RAGPretrainedModel.index() or ColBERT.index(). It applies whenever you are building a new PLAID index and need to balance compatibility against performance.
The Insight (Rule of Thumb)
- Action: Leave `use_faiss=False` (default) for collections < 75,000 documents. Set `use_faiss=True` for collections >= 75,000 documents or when FAISS is reliably available.
- Value: The 75,000 document threshold is hardcoded in `PLAIDModelIndex.build()`.
- Trade-off: PyTorch KMeans is more compatible (no FAISS dependency) but can be considerably slower and could cause worse results in some situations per the in-code warning. FAISS is faster but requires correct installation (faiss-cpu or faiss-gpu).
- Fallback: If PyTorch KMeans fails, RAGatouille automatically falls back to FAISS and retries.
Reasoning
The PyTorch KMeans implementation using `fast-pytorch-kmeans` works well for small to medium collections but has not been extensively validated at scale. The warning in the code states this is "experimental" and a "behaviour change from RAGatouille 0.8.0 onwards". The automatic fallback to FAISS on failure provides a safety net. For production deployments with large collections, FAISS with GPU support is recommended.
Code evidence from `ragatouille/models/index.py:186-222`:
monkey_patching = (
len(collection) < 75000 and kwargs.get("use_faiss", False) is False
)
if monkey_patching:
print(
"---- WARNING! You are using PLAID with an experimental replacement for FAISS for greater compatibility ----"
)
print("This is a behaviour change from RAGatouille 0.8.0 onwards.")
print(
"This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations."
)
# ...
try:
# attempt PyTorch KMeans
except Exception as err:
print(f"PyTorch-based indexing did not succeed with error: {err}",
"! Reverting to using FAISS and attempting again...")
monkey_patching = False