Principle:AnswerDotAI RAGatouille In Memory Document Encoding
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, Encoding |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
An index-free document encoding mechanism that computes and stores ColBERT token-level embeddings in GPU/CPU memory for immediate search without building a persistent PLAID index.
Description
In-Memory Document Encoding provides a lightweight alternative to full PLAID indexing. Instead of building a compressed on-disk index, documents are encoded into dense token-level embedding tensors that are held in memory. This enables fast prototyping, small-collection search, and reranking workflows where the overhead of building a full index is unnecessary.
The encoding process:
- Documents are tokenized and encoded through the ColBERT checkpoint to produce per-token embeddings
- Embeddings are padded to uniform length for efficient batched MaxSim computation
- Document attention masks are created to distinguish real tokens from padding
- Results are stored as tensors in memory (in_memory_embed_docs, doc_masks)
- Supports incremental encoding — calling encode multiple times appends to existing tensors
- Auto-adjusts batch size for long documents to manage memory
Usage
Use this principle when:
- Working with small document collections (performance degrades with more documents)
- Prototyping search without the overhead of building a full index
- Documents change frequently and rebuilding an index each time is impractical
- You need to search a temporary collection that won't be persisted
For collections larger than ~1000 documents, prefer building a PLAID index instead.
Theoretical Basis
In-memory encoding computes the same token-level representations as PLAID indexing but without the compression step:
Where n is the padded token count and h is the embedding dimension. The full dense tensors are stored, enabling exact MaxSim computation without the approximation inherent in PLAID's centroid-based search.
The tradeoff is memory: storing full float tensors uses significantly more memory than quantized PLAID indexes, but provides exact scoring.