Principle:AnswerDotAI RAGatouille Index Update
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, Index_Management |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
A dynamic index management mechanism that supports adding new documents to and removing existing documents from a PLAID index without requiring a full rebuild.
Description
Index Update enables incremental modification of a pre-built PLAID index. It supports two operations: addition of new documents and deletion of existing documents by their IDs. For additions, a heuristic determines whether to use incremental update (via IndexUpdater) or full index rebuild based on the relative size of the new documents compared to the existing collection. For deletions, passages corresponding to the specified document IDs are removed via the IndexUpdater.
Key behaviors:
- Add: New documents are preprocessed, deduplicated against existing documents, encoded, and either incrementally added or trigger a full rebuild
- Delete: Document IDs are mapped to passage IDs (PIDs), which are removed from the index
- Both operations update the on-disk metadata (collection.json, pid_docid_map.json, docid_metadata_map.json)
- The rebuild heuristic triggers a full rebuild when the collection is small (<5000 total) or additions exceed 5% of the existing collection
Usage
Use this principle when you need to modify an existing index without rebuilding from scratch. Common scenarios:
- Adding newly created documents to a knowledge base
- Removing outdated or deleted documents
- Incremental updates to a production search index
This operation requires a previously built or loaded index.
Theoretical Basis
Incremental index updates in PLAID involve:
Addition:
- New documents are encoded into token embeddings
- Embeddings are assigned to existing centroids
- Residuals are quantized and appended to the inverted lists
- If additions are large relative to the corpus, a full rebuild with new centroids is more efficient
Deletion:
- Passage IDs (PIDs) corresponding to the target document IDs are identified
- PIDs are removed from the inverted lists
- The collection and metadata mappings are updated
Rebuild Heuristic:
# Pseudo-code for rebuild decision
def should_rebuild(current_len, new_doc_len):
return current_len + new_doc_len < 5000 or new_doc_len > current_len * 0.05