Heuristic:AnswerDotAI RAGatouille Index Rebuild Vs Update Decision
| Knowledge Sources | |
|---|---|
| Domains | Indexing, Optimization, Information_Retrieval |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
Decision rule for determining whether to rebuild the entire PLAID index or incrementally update it when adding new documents.
Description
When adding documents to an existing index via add_to_index(), RAGatouille uses a heuristic in `PLAIDModelIndex._should_rebuild()` to decide whether a full index rebuild or an incremental update is more efficient. The heuristic considers the ratio of new documents to existing documents and the total resulting collection size. For small indexes or when adding a significant proportion of new documents, rebuilding produces better index quality. For large indexes with few additions, incremental updates are faster.
Usage
Use this heuristic when calling RAGPretrainedModel.add_to_index() and wanting to understand why the library sometimes triggers a full rebuild instead of an incremental update. This is important for production systems where index update latency matters.
The Insight (Rule of Thumb)
- Action: The system automatically decides based on two conditions (logical OR):
- If `current_len + new_doc_len < 5000` → Rebuild (small index, rebuild is cheap)
- If `new_doc_len > current_len * 0.05` → Rebuild (adding >5% of existing size, centroids should be recomputed)
- Otherwise → Incremental update via IndexUpdater
- Trade-off: Full rebuild produces optimal centroid quality but takes longer. Incremental update is faster but centroids may become stale if too many documents are added without rebuilding.
- Note: Both `add_to_index` and `delete_from_index` are marked as "experimental" in the code.
Reasoning
PLAID indexes use KMeans centroids to partition the embedding space. When many new documents are added, the centroid distribution may no longer be representative, degrading search quality. For small indexes, the cost of rebuilding is low enough that it is always worthwhile. The 5% threshold balances index freshness against rebuild cost.
The incremental update path uses ColBERT's `IndexUpdater` which adds new document embeddings to the existing inverted lists without recomputing centroids.
Decision function from `ragatouille/models/index.py:363-368`:
@staticmethod
def _should_rebuild(current_len: int, new_doc_len: int) -> bool:
"""
Heuristic to determine if it is more efficient to rebuild the index instead of updating it.
"""
return current_len + new_doc_len < 5000 or new_doc_len > current_len * 0.05
Usage in add() from `ragatouille/models/index.py:395-414`:
if PLAIDModelIndex._should_rebuild(
len(searcher.collection), len(new_collection)
):
self.build(
checkpoint=checkpoint,
collection=collection + new_collection,
index_name=index_name,
overwrite="force_silent_overwrite",
verbose=verbose,
**kwargs,
)
else:
updater = IndexUpdater(
config=self.config, searcher=searcher, checkpoint=checkpoint
)
updater.add(new_collection)
updater.persist_to_disk()