Heuristic:AnswerDotAI RAGatouille Searcher Configuration By Collection Size
| Knowledge Sources | |
|---|---|
| Domains | Search, Optimization, Information_Retrieval |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
Adaptive search parameters (ncells, centroid_score_threshold, ndocs) tuned by collection size and k value for balancing search speed against recall.
Description
When the PLAID searcher is initialized, RAGatouille automatically configures the searcher's ncells, centroid_score_threshold, and ndocs parameters based on the collection size. These parameters control how many centroid clusters are probed during search, what score threshold centroids must exceed to be considered, and how many candidate documents are evaluated. Smaller collections use fewer cells but a more lenient threshold, while larger collections can use defaults. Additionally, the force_fast mode drastically reduces these values for speed at the cost of recall.
Usage
Use this heuristic when tuning search latency vs recall, or when encountering issues where search returns fewer results than expected for small collections.
The Insight (Rule of Thumb)
- Normal mode by collection size:
- All collections: `ndocs=1024` (base)
- < 10,000 docs: `ncells=8`, `centroid_score_threshold=0.4`
- < 100,000 docs: `ncells=4`, `centroid_score_threshold=0.45`
- >= 100,000 docs: `ncells=16` (default)
- Force-fast mode:
- `ncells=1`, `centroid_score_threshold=0.5`, `ndocs=256`
- Dynamic k adjustment:
- If `k > 32 * ncells`, ncells is increased to `min(k // 32 + 2, base_ncells)`
- ndocs is set to `max(k * 4, base_ndocs)`
- Trade-off: Lower ncells/ndocs = faster search but lower recall. Higher centroid_score_threshold = fewer candidates evaluated.
Reasoning
Small collections have fewer centroids, so probing fewer cells can still cover most of the search space. The centroid_score_threshold is lowered for small collections to ensure enough candidates pass the initial filter. For large k values, more cells must be probed to gather enough candidate passages. The dynamic adjustment prevents the pathological case where k exceeds the number of candidates from the configured number of cells.
Code evidence from `ragatouille/models/index.py:266-280`:
if not force_fast:
self.searcher.configure(ndocs=1024)
self.searcher.configure(ncells=16)
if len(self.searcher.collection) < 10000:
self.searcher.configure(ncells=8)
self.searcher.configure(centroid_score_threshold=0.4)
elif len(self.searcher.collection) < 100000:
self.searcher.configure(ncells=4)
self.searcher.configure(centroid_score_threshold=0.45)
else:
self.searcher.configure(ncells=1)
self.searcher.configure(centroid_score_threshold=0.5)
self.searcher.configure(ndocs=256)
Dynamic k adjustment from `ragatouille/models/index.py:343-346`:
if k > (32 * self.searcher.config.ncells):
self.searcher.configure(ncells=min((k // 32 + 2), base_ncells))
self.searcher.configure(ndocs=max(k * 4, base_ndocs))