Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:AnswerDotAI RAGatouille FAISS Vs PyTorch KMeans Indexing

From Leeroopedia
Knowledge Sources
Domains Indexing, Optimization, Information_Retrieval
Last Updated 2026-02-12 12:00 GMT

Overview

Decision framework for choosing between PyTorch KMeans (default, more compatible) and FAISS KMeans (faster, requires faiss-gpu) when building PLAID indexes.

Description

RAGatouille v0.8.0+ introduced a PyTorch-based KMeans replacement for FAISS during index building. This replacement is activated by default for collections under 75,000 documents when `use_faiss=False` (the default). The PyTorch implementation monkey-patches `CollectionIndexer._train_kmeans` at runtime to avoid requiring FAISS entirely, improving cross-platform compatibility. However, for large collections, FAISS (especially faiss-gpu) is substantially faster and produces better centroid quality.

Usage

Use this heuristic when deciding how to configure the `use_faiss` parameter in RAGPretrainedModel.index() or ColBERT.index(). It applies whenever you are building a new PLAID index and need to balance compatibility against performance.

The Insight (Rule of Thumb)

  • Action: Leave `use_faiss=False` (default) for collections < 75,000 documents. Set `use_faiss=True` for collections >= 75,000 documents or when FAISS is reliably available.
  • Value: The 75,000 document threshold is hardcoded in `PLAIDModelIndex.build()`.
  • Trade-off: PyTorch KMeans is more compatible (no FAISS dependency) but can be considerably slower and could cause worse results in some situations per the in-code warning. FAISS is faster but requires correct installation (faiss-cpu or faiss-gpu).
  • Fallback: If PyTorch KMeans fails, RAGatouille automatically falls back to FAISS and retries.

Reasoning

The PyTorch KMeans implementation using `fast-pytorch-kmeans` works well for small to medium collections but has not been extensively validated at scale. The warning in the code states this is "experimental" and a "behaviour change from RAGatouille 0.8.0 onwards". The automatic fallback to FAISS on failure provides a safety net. For production deployments with large collections, FAISS with GPU support is recommended.

Code evidence from `ragatouille/models/index.py:186-222`:

monkey_patching = (
    len(collection) < 75000 and kwargs.get("use_faiss", False) is False
)
if monkey_patching:
    print(
        "---- WARNING! You are using PLAID with an experimental replacement for FAISS for greater compatibility ----"
    )
    print("This is a behaviour change from RAGatouille 0.8.0 onwards.")
    print(
        "This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations."
    )
    # ...
    try:
        # attempt PyTorch KMeans
    except Exception as err:
        print(f"PyTorch-based indexing did not succeed with error: {err}",
              "! Reverting to using FAISS and attempting again...")
        monkey_patching = False

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment