Principle:AnswerDotAI RAGatouille Document Indexing
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, Indexing |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
A process that transforms a collection of text documents into a compressed, searchable PLAID index by encoding documents into token-level embeddings and clustering them for efficient retrieval.
Description
Document Indexing is the core offline step in a ColBERT retrieval pipeline. It takes a collection of raw text documents, optionally splits them into smaller passages using a sentence splitter, encodes each passage into contextualized token-level embeddings using the ColBERT encoder, and then builds a PLAID index structure. The PLAID index uses k-means clustering to create centroids over the token embeddings, then stores compressed residual vectors (typically 2-bit or 4-bit quantized) for each token, enabling efficient approximate nearest-neighbor search at query time.
The indexing pipeline involves:
- Corpus preprocessing: splitting documents into passages, assigning IDs
- Token-level encoding: each passage is encoded into a matrix of contextualized token embeddings
- Centroid computation: k-means clustering over all token embeddings
- Residual compression: quantizing the difference between each embedding and its nearest centroid
- Index serialization: writing the compressed index to disk
Usage
Use this principle when you need to build a persistent, searchable index over a document collection. This is the standard approach for:
- Building a retrieval system for RAG (Retrieval-Augmented Generation)
- Creating a semantic search engine over a corpus
- Preparing documents for repeated querying
Document indexing is a one-time offline cost that enables fast online search. For small collections or one-time searches, consider in-memory encoding instead.
Theoretical Basis
PLAID indexing involves several stages:
1. Token-level Encoding: Each document d is encoded into a matrix of token embeddings: where n is the number of tokens and h is the embedding dimension.
2. Centroid Computation: K-means clustering is applied over all token embeddings across the corpus to find C centroids. The number of k-means iterations is adapted to corpus size.
3. Residual Quantization: For each token embedding, the residual from its nearest centroid is quantized to b bits (typically 2 or 4):
4. Inverted Index: An inverted list maps each centroid to the set of token embeddings assigned to it, enabling efficient candidate generation during search.