Workflow:AnswerDotAI RAGatouille In Memory Retrieval

Knowledge Sources	RAGatouille RAGatouille Docs ColBERTv2
Domains	Information_Retrieval, RAG, ColBERT
Last Updated	2026-02-12 12:00 GMT

Overview

Process for encoding documents into in-memory ColBERT embeddings and performing index-free retrieval or reranking without building a persistent PLAID index.

Description

This workflow covers two related capabilities in RAGatouille that operate without a persistent on-disk index: in-memory document encoding with search, and document reranking. Both use ColBERT's late-interaction scoring mechanism directly on encoded embeddings held in memory. The encoding workflow allows incremental addition of documents to the in-memory store, while reranking operates on a one-shot list of candidate documents.

Key outputs:

Ranked search results with content, scores, and ranks (no persistent index created)
Reranked document lists scored by ColBERT relevance

Scope:

From raw text documents to in-memory retrieval results or reranked candidates
Suitable for small to moderate document sets where index construction overhead is unnecessary

Trade-offs:

Performance degrades rapidly with more documents (no compression or approximate search)
No persistence; embeddings are lost when the model object is destroyed

Usage

Execute this workflow when you need lightweight ColBERT-quality retrieval without the overhead of building and persisting a full PLAID index. This is the right workflow when:

You have a small set of candidate documents (fewer than ~1000) to search or rerank
You are integrating ColBERT as a reranker in a multi-stage retrieval pipeline
You want to quickly prototype retrieval quality before committing to full index construction
You need to incrementally encode and search documents in a session without disk I/O

Execution Steps

Step 1: Load Pretrained Model

Initialize a RAGPretrainedModel from a pretrained ColBERT checkpoint. This loads the encoder and inference checkpoint needed for document and query encoding. The same model loading step as the indexing workflow applies here.

Key considerations:

Use `from_pretrained()` with a HuggingFace model name or local path
The inference checkpoint must be loaded (not training mode)

Step 2: Encode Documents (for In-Memory Search)

Encode a list of documents into ColBERT token-level embeddings held in GPU or CPU memory. The encoding computes per-token vectors for each document and stores them alongside a document mask tensor. Documents can be encoded incrementally across multiple calls, and optional metadata can be attached.

What happens:

Each document is tokenized and passed through the ColBERT encoder
Token-level embeddings are padded to the maximum document length and concatenated
The in-memory store tracks the document collection, embeddings, masks, and metadata
Maximum token length is auto-calibrated based on the 90th percentile of document lengths

Step 3: Search Encoded Documents

Query the in-memory encoded documents using ColBERT's MaxSim scoring. Each query token's embedding is compared against all document token embeddings, taking the maximum similarity per document token, then summing across query tokens. Results are sorted by score and the top-k are returned.

Key considerations:

Supports single or batch queries
Returns results as dictionaries with content, score, rank, and result index
If metadata was attached during encoding, it is included in results
No approximate search is used; this is exact brute-force scoring

Step 4: Rerank Candidate Documents (Alternative Path)

Instead of encoding and searching, directly rerank a provided list of candidate documents against a query. This is a one-shot operation that encodes both query and documents on the fly, computes MaxSim scores, and returns the top-k reranked results.

Key considerations:

Suitable for reranking output from a first-stage retriever (BM25, dense embeddings, etc.)
Performance warning for documents longer than ~300 tokens at the 90th percentile
Duplicate documents degrade performance and result quality
The `k` parameter must not exceed the number of provided documents

Step 5: Clean Up (Optional)

Clear in-memory encoded documents to free GPU/CPU memory. By default, a 10-second safety delay is imposed before deletion to prevent accidental data loss. This can be bypassed with the force parameter.

Key considerations:

Deletes all stored embeddings, masks, metadata, and the in-memory collection
Required before re-encoding a different document set if memory is constrained

Execution Diagram

GitHub URL

Workflow Repository