Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:AnswerDotAI RAGatouille Index Rebuild Vs Update Decision

From Leeroopedia
Knowledge Sources
Domains Indexing, Optimization, Information_Retrieval
Last Updated 2026-02-12 12:00 GMT

Overview

Decision rule for determining whether to rebuild the entire PLAID index or incrementally update it when adding new documents.

Description

When adding documents to an existing index via add_to_index(), RAGatouille uses a heuristic in `PLAIDModelIndex._should_rebuild()` to decide whether a full index rebuild or an incremental update is more efficient. The heuristic considers the ratio of new documents to existing documents and the total resulting collection size. For small indexes or when adding a significant proportion of new documents, rebuilding produces better index quality. For large indexes with few additions, incremental updates are faster.

Usage

Use this heuristic when calling RAGPretrainedModel.add_to_index() and wanting to understand why the library sometimes triggers a full rebuild instead of an incremental update. This is important for production systems where index update latency matters.

The Insight (Rule of Thumb)

  • Action: The system automatically decides based on two conditions (logical OR):
    • If `current_len + new_doc_len < 5000` → Rebuild (small index, rebuild is cheap)
    • If `new_doc_len > current_len * 0.05` → Rebuild (adding >5% of existing size, centroids should be recomputed)
    • Otherwise → Incremental update via IndexUpdater
  • Trade-off: Full rebuild produces optimal centroid quality but takes longer. Incremental update is faster but centroids may become stale if too many documents are added without rebuilding.
  • Note: Both `add_to_index` and `delete_from_index` are marked as "experimental" in the code.

Reasoning

PLAID indexes use KMeans centroids to partition the embedding space. When many new documents are added, the centroid distribution may no longer be representative, degrading search quality. For small indexes, the cost of rebuilding is low enough that it is always worthwhile. The 5% threshold balances index freshness against rebuild cost.

The incremental update path uses ColBERT's `IndexUpdater` which adds new document embeddings to the existing inverted lists without recomputing centroids.

Decision function from `ragatouille/models/index.py:363-368`:

@staticmethod
def _should_rebuild(current_len: int, new_doc_len: int) -> bool:
    """
    Heuristic to determine if it is more efficient to rebuild the index instead of updating it.
    """
    return current_len + new_doc_len < 5000 or new_doc_len > current_len * 0.05

Usage in add() from `ragatouille/models/index.py:395-414`:

if PLAIDModelIndex._should_rebuild(
    len(searcher.collection), len(new_collection)
):
    self.build(
        checkpoint=checkpoint,
        collection=collection + new_collection,
        index_name=index_name,
        overwrite="force_silent_overwrite",
        verbose=verbose,
        **kwargs,
    )
else:
    updater = IndexUpdater(
        config=self.config, searcher=searcher, checkpoint=checkpoint
    )
    updater.add(new_collection)
    updater.persist_to_disk()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment