Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:AnswerDotAI RAGatouille Index Update

From Leeroopedia
Knowledge Sources
Domains NLP, Information_Retrieval, Index_Management
Last Updated 2026-02-12 12:00 GMT

Overview

A dynamic index management mechanism that supports adding new documents to and removing existing documents from a PLAID index without requiring a full rebuild.

Description

Index Update enables incremental modification of a pre-built PLAID index. It supports two operations: addition of new documents and deletion of existing documents by their IDs. For additions, a heuristic determines whether to use incremental update (via IndexUpdater) or full index rebuild based on the relative size of the new documents compared to the existing collection. For deletions, passages corresponding to the specified document IDs are removed via the IndexUpdater.

Key behaviors:

  • Add: New documents are preprocessed, deduplicated against existing documents, encoded, and either incrementally added or trigger a full rebuild
  • Delete: Document IDs are mapped to passage IDs (PIDs), which are removed from the index
  • Both operations update the on-disk metadata (collection.json, pid_docid_map.json, docid_metadata_map.json)
  • The rebuild heuristic triggers a full rebuild when the collection is small (<5000 total) or additions exceed 5% of the existing collection

Usage

Use this principle when you need to modify an existing index without rebuilding from scratch. Common scenarios:

  • Adding newly created documents to a knowledge base
  • Removing outdated or deleted documents
  • Incremental updates to a production search index

This operation requires a previously built or loaded index.

Theoretical Basis

Incremental index updates in PLAID involve:

Addition:

  1. New documents are encoded into token embeddings
  2. Embeddings are assigned to existing centroids
  3. Residuals are quantized and appended to the inverted lists
  4. If additions are large relative to the corpus, a full rebuild with new centroids is more efficient

Deletion:

  1. Passage IDs (PIDs) corresponding to the target document IDs are identified
  2. PIDs are removed from the inverted lists
  3. The collection and metadata mappings are updated

Rebuild Heuristic:

# Pseudo-code for rebuild decision
def should_rebuild(current_len, new_doc_len):
    return current_len + new_doc_len < 5000 or new_doc_len > current_len * 0.05

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment