Workflow:AnswerDotAI RAGatouille In Memory Retrieval
| Knowledge Sources | |
|---|---|
| Domains | Information_Retrieval, RAG, ColBERT |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
Process for encoding documents into in-memory ColBERT embeddings and performing index-free retrieval or reranking without building a persistent PLAID index.
Description
This workflow covers two related capabilities in RAGatouille that operate without a persistent on-disk index: in-memory document encoding with search, and document reranking. Both use ColBERT's late-interaction scoring mechanism directly on encoded embeddings held in memory. The encoding workflow allows incremental addition of documents to the in-memory store, while reranking operates on a one-shot list of candidate documents.
Key outputs:
- Ranked search results with content, scores, and ranks (no persistent index created)
- Reranked document lists scored by ColBERT relevance
Scope:
- From raw text documents to in-memory retrieval results or reranked candidates
- Suitable for small to moderate document sets where index construction overhead is unnecessary
Trade-offs:
- Performance degrades rapidly with more documents (no compression or approximate search)
- No persistence; embeddings are lost when the model object is destroyed
Usage
Execute this workflow when you need lightweight ColBERT-quality retrieval without the overhead of building and persisting a full PLAID index. This is the right workflow when:
- You have a small set of candidate documents (fewer than ~1000) to search or rerank
- You are integrating ColBERT as a reranker in a multi-stage retrieval pipeline
- You want to quickly prototype retrieval quality before committing to full index construction
- You need to incrementally encode and search documents in a session without disk I/O
Execution Steps
Step 1: Load Pretrained Model
Initialize a RAGPretrainedModel from a pretrained ColBERT checkpoint. This loads the encoder and inference checkpoint needed for document and query encoding. The same model loading step as the indexing workflow applies here.
Key considerations:
- Use `from_pretrained()` with a HuggingFace model name or local path
- The inference checkpoint must be loaded (not training mode)
Step 2: Encode Documents (for In-Memory Search)
Encode a list of documents into ColBERT token-level embeddings held in GPU or CPU memory. The encoding computes per-token vectors for each document and stores them alongside a document mask tensor. Documents can be encoded incrementally across multiple calls, and optional metadata can be attached.
What happens:
- Each document is tokenized and passed through the ColBERT encoder
- Token-level embeddings are padded to the maximum document length and concatenated
- The in-memory store tracks the document collection, embeddings, masks, and metadata
- Maximum token length is auto-calibrated based on the 90th percentile of document lengths
Step 3: Search Encoded Documents
Query the in-memory encoded documents using ColBERT's MaxSim scoring. Each query token's embedding is compared against all document token embeddings, taking the maximum similarity per document token, then summing across query tokens. Results are sorted by score and the top-k are returned.
Key considerations:
- Supports single or batch queries
- Returns results as dictionaries with content, score, rank, and result index
- If metadata was attached during encoding, it is included in results
- No approximate search is used; this is exact brute-force scoring
Step 4: Rerank Candidate Documents (Alternative Path)
Instead of encoding and searching, directly rerank a provided list of candidate documents against a query. This is a one-shot operation that encodes both query and documents on the fly, computes MaxSim scores, and returns the top-k reranked results.
Key considerations:
- Suitable for reranking output from a first-stage retriever (BM25, dense embeddings, etc.)
- Performance warning for documents longer than ~300 tokens at the 90th percentile
- Duplicate documents degrade performance and result quality
- The `k` parameter must not exceed the number of provided documents
Step 5: Clean Up (Optional)
Clear in-memory encoded documents to free GPU/CPU memory. By default, a 10-second safety delay is imposed before deletion to prevent accidental data loss. This can be bypassed with the force parameter.
Key considerations:
- Deletes all stored embeddings, masks, metadata, and the in-memory collection
- Required before re-encoding a different document set if memory is constrained