Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:AnswerDotAI RAGatouille Searcher Configuration By Collection Size

From Leeroopedia
Knowledge Sources
Domains Search, Optimization, Information_Retrieval
Last Updated 2026-02-12 12:00 GMT

Overview

Adaptive search parameters (ncells, centroid_score_threshold, ndocs) tuned by collection size and k value for balancing search speed against recall.

Description

When the PLAID searcher is initialized, RAGatouille automatically configures the searcher's ncells, centroid_score_threshold, and ndocs parameters based on the collection size. These parameters control how many centroid clusters are probed during search, what score threshold centroids must exceed to be considered, and how many candidate documents are evaluated. Smaller collections use fewer cells but a more lenient threshold, while larger collections can use defaults. Additionally, the force_fast mode drastically reduces these values for speed at the cost of recall.

Usage

Use this heuristic when tuning search latency vs recall, or when encountering issues where search returns fewer results than expected for small collections.

The Insight (Rule of Thumb)

  • Normal mode by collection size:
    • All collections: `ndocs=1024` (base)
    • < 10,000 docs: `ncells=8`, `centroid_score_threshold=0.4`
    • < 100,000 docs: `ncells=4`, `centroid_score_threshold=0.45`
    • >= 100,000 docs: `ncells=16` (default)
  • Force-fast mode:
    • `ncells=1`, `centroid_score_threshold=0.5`, `ndocs=256`
  • Dynamic k adjustment:
    • If `k > 32 * ncells`, ncells is increased to `min(k // 32 + 2, base_ncells)`
    • ndocs is set to `max(k * 4, base_ndocs)`
  • Trade-off: Lower ncells/ndocs = faster search but lower recall. Higher centroid_score_threshold = fewer candidates evaluated.

Reasoning

Small collections have fewer centroids, so probing fewer cells can still cover most of the search space. The centroid_score_threshold is lowered for small collections to ensure enough candidates pass the initial filter. For large k values, more cells must be probed to gather enough candidate passages. The dynamic adjustment prevents the pathological case where k exceeds the number of candidates from the configured number of cells.

Code evidence from `ragatouille/models/index.py:266-280`:

if not force_fast:
    self.searcher.configure(ndocs=1024)
    self.searcher.configure(ncells=16)
    if len(self.searcher.collection) < 10000:
        self.searcher.configure(ncells=8)
        self.searcher.configure(centroid_score_threshold=0.4)
    elif len(self.searcher.collection) < 100000:
        self.searcher.configure(ncells=4)
        self.searcher.configure(centroid_score_threshold=0.45)
else:
    self.searcher.configure(ncells=1)
    self.searcher.configure(centroid_score_threshold=0.5)
    self.searcher.configure(ndocs=256)

Dynamic k adjustment from `ragatouille/models/index.py:343-346`:

if k > (32 * self.searcher.config.ncells):
    self.searcher.configure(ncells=min((k // 32 + 2), base_ncells))

self.searcher.configure(ndocs=max(k * 4, base_ndocs))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment