Principle:PacktPublishing LLM Engineers Handbook Cross Encoder Reranking
| Field | Value |
|---|---|
| Concept | Reranking retrieved candidates using a cross-encoder model |
| Category | Retrieval / Reranking |
| Workflow | RAG_Inference |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Implemented by | Implementation:PacktPublishing_LLM_Engineers_Handbook_Reranker_Generate |
Overview
Cross-Encoder Reranking is a two-stage retrieval approach where initial candidates from fast vector search are re-scored using a more accurate but slower cross-encoder model. Unlike bi-encoders (used for initial retrieval) which encode query and document independently, cross-encoders jointly encode the (query, document) pair, capturing fine-grained interactions. This significantly improves precision at the cost of being non-indexable, hence it is used only on a small candidate set.
Theory
Mathematical Basis
The cross-encoder produces a single relevance score for each (query, document) pair:
score = CrossEncoder(query, document) -> R
The top-K candidates are then selected by descending score.
Bi-Encoder vs. Cross-Encoder
| Property | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Encoding | Query and document encoded independently | Query and document encoded jointly |
| Interaction | Late interaction (dot product / cosine) | Early interaction (full self-attention) |
| Indexability | Can pre-compute document embeddings | Cannot pre-compute; requires query at inference |
| Speed | Fast (sub-linear with ANN index) | Slow (linear in number of candidates) |
| Accuracy | Good | Superior (captures cross-attention between query and document tokens) |
| Use case | Initial retrieval over large collection | Reranking a small candidate set |
Two-Stage Retrieval
The two-stage approach combines the strengths of both models:
- Stage 1 (Recall) - The bi-encoder performs fast ANN search to retrieve a broad candidate set (e.g., top 50-100 documents)
- Stage 2 (Precision) - The cross-encoder re-scores the candidates with higher accuracy and selects the top-K (e.g., top 5-10) for final use
This achieves near-cross-encoder accuracy with near-bi-encoder latency.
When to Use
- When improving retrieval precision by re-scoring initial vector search candidates
- When the initial retrieval returns a manageable number of candidates (typically under 100)
- When answer quality is more important than retrieval latency
- When the application can afford the additional compute of running a cross-encoder model
Related Concepts
- Multi-stage retrieval - cascading retrieval stages with increasing accuracy
- ColBERT - late-interaction model that balances speed and accuracy
- Two-tower models - architecture where query and document are encoded separately
- Cross-attention - transformer mechanism that attends across two sequences