Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Neuml Txtai Two Stage Retrieval

From Leeroopedia


Knowledge Sources
Domains Information_Retrieval, Reranking
Last Updated 2026-02-09 17:00 GMT

Overview

Two-stage retrieval pipelines use a fast first-stage retriever to generate candidate documents, then apply a more expensive cross-encoder reranker to re-score and reorder candidates for higher precision.

Description

The two-stage retrieve-then-rerank architecture addresses a fundamental tension in information retrieval: fast retrieval methods that can efficiently search millions of documents tend to produce imprecise rankings, while highly accurate scoring methods are too computationally expensive to apply to the entire collection. By combining both approaches in a pipeline, the system achieves high recall through the first stage and high precision through the second stage.

In txtai, the first stage typically uses the embeddings search engine, which performs approximate nearest neighbor (ANN) lookup over pre-computed dense vectors. This stage is optimized for speed and recall, returning a candidate set of the top-k most similar documents. The candidate set size k is a critical parameter: too small and relevant documents may be missed before the reranker can consider them; too large and the reranker becomes a throughput bottleneck. Typical values range from 50 to 1000 depending on collection size and latency requirements.

The second stage applies a cross-encoder model that jointly processes the query and each candidate document through a transformer with full bidirectional attention. Unlike bi-encoder models that independently encode query and document, the cross-encoder can model fine-grained token interactions, capturing nuanced relevance signals such as negation, qualification, and contextual meaning. The cross-encoder produces a scalar relevance score for each query-candidate pair. These scores replace the first-stage scores, and the candidates are reordered accordingly. Score normalization between stages may be applied to maintain consistent score semantics for downstream consumers.

Usage

Deploy two-stage retrieval when single-stage dense retrieval does not meet precision requirements, especially for queries involving complex semantics, negation, or multi-hop reasoning. The pattern is standard in production search systems and RAG pipelines where answer quality depends on the top few results. Tune the candidate set size k based on the tradeoff between reranking latency and the risk of missing relevant documents in the first stage.

Key Considerations

The reranker model must be selected to complement the first-stage retriever. Using a cross-encoder from the same model family as the bi-encoder (e.g., both based on the same pre-trained language model) can sometimes lead to correlated errors. Diversity in model architecture between stages can improve overall pipeline robustness.

Latency budgets should account for both stages. The first stage typically completes in single-digit milliseconds for ANN search, while the reranker may require 10-100 milliseconds per candidate document depending on model size and hardware. For a candidate set of 100 documents, total reranking latency may reach 1-10 seconds without batching optimizations or GPU acceleration.

In RAG applications, two-stage retrieval is especially valuable because the language model's answer quality is highly sensitive to the relevance of the top few retrieved passages. Even modest improvements in top-5 precision from reranking translate to measurable gains in downstream answer accuracy and faithfulness.

Document truncation is a practical concern for cross-encoder reranking. Most cross-encoder models have a maximum input length (typically 512 tokens), meaning long documents must be truncated or chunked before scoring. Strategies include truncating to the first 512 tokens, scoring multiple chunks and taking the maximum score, or pre-selecting the most relevant passage within each document using a lighter-weight method.

Caching reranker scores for frequently repeated queries can substantially reduce average latency in production systems. When the document collection is relatively stable and the same queries recur, memoizing cross-encoder scores avoids redundant inference and brings the effective latency of two-stage retrieval closer to that of single-stage search.

Theoretical Basis

1. Retrieve-then-rerank architecture decomposes search into two phases: a high-recall, low-precision retriever that narrows the search space from millions to hundreds of candidates, followed by a high-precision, low-throughput reranker that optimizes the final ranking.

2. Cross-encoder scoring passes the concatenated query-document pair through a transformer model, producing a relevance score informed by full bidirectional attention across all query and document tokens, capturing token-level interactions that bi-encoders cannot model.

3. Candidate set size selection (parameter k) governs the recall ceiling of the pipeline: the reranker cannot promote documents not present in the candidate set, so k must be large enough to include most relevant documents while remaining small enough for acceptable reranking latency.

4. Score normalization between stages reconciles the different score distributions produced by the retriever (e.g., cosine similarity in [-1, 1]) and the reranker (e.g., logits in an unbounded range), ensuring that downstream consumers receive consistent and interpretable scores.

5. Diminishing returns of deeper reranking follow from the observation that retriever scores are positively correlated with true relevance, so documents ranked beyond position k are increasingly unlikely to be relevant, making the marginal benefit of increasing k shrink relative to the added latency cost.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment