Principle:AnswerDotAI RAGatouille Index Update

Knowledge Sources	ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT PLAID: An Efficient Engine for Late Interaction Retrieval RAGatouille
Domains	NLP, Information_Retrieval, Index_Management
Last Updated	2026-02-12 12:00 GMT

Overview

A dynamic index management mechanism that supports adding new documents to and removing existing documents from a PLAID index without requiring a full rebuild.

Description

Index Update enables incremental modification of a pre-built PLAID index. It supports two operations: addition of new documents and deletion of existing documents by their IDs. For additions, a heuristic determines whether to use incremental update (via IndexUpdater) or full index rebuild based on the relative size of the new documents compared to the existing collection. For deletions, passages corresponding to the specified document IDs are removed via the IndexUpdater.

Key behaviors:

Add: New documents are preprocessed, deduplicated against existing documents, encoded, and either incrementally added or trigger a full rebuild
Delete: Document IDs are mapped to passage IDs (PIDs), which are removed from the index
Both operations update the on-disk metadata (collection.json, pid_docid_map.json, docid_metadata_map.json)
The rebuild heuristic triggers a full rebuild when the collection is small (<5000 total) or additions exceed 5% of the existing collection

Usage

Use this principle when you need to modify an existing index without rebuilding from scratch. Common scenarios:

Adding newly created documents to a knowledge base
Removing outdated or deleted documents
Incremental updates to a production search index

This operation requires a previously built or loaded index.

Theoretical Basis

Incremental index updates in PLAID involve:

Addition:

New documents are encoded into token embeddings
Embeddings are assigned to existing centroids
Residuals are quantized and appended to the inverted lists
If additions are large relative to the corpus, a full rebuild with new centroids is more efficient

Deletion:

Passage IDs (PIDs) corresponding to the target document IDs are identified
PIDs are removed from the inverted lists
The collection and metadata mappings are updated

Rebuild Heuristic:

# Pseudo-code for rebuild decision
def should_rebuild(current_len, new_doc_len):
    return current_len + new_doc_len < 5000 or new_doc_len > current_len * 0.05

Related Pages

Implemented By

Implementation:AnswerDotAI_RAGatouille_RAGPretrainedModel_Add_To_Index

Uses Heuristic

Heuristic:AnswerDotAI_RAGatouille_Index_Rebuild_Vs_Update_Decision

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment