Principle:PacktPublishing LLM Engineers Handbook Cross Encoder Reranking

Field	Value
Concept	Reranking retrieved candidates using a cross-encoder model
Category	Retrieval / Reranking
Workflow	RAG_Inference
Repository	PacktPublishing/LLM-Engineers-Handbook
Implemented by	Implementation:PacktPublishing_LLM_Engineers_Handbook_Reranker_Generate

Overview

Cross-Encoder Reranking is a two-stage retrieval approach where initial candidates from fast vector search are re-scored using a more accurate but slower cross-encoder model. Unlike bi-encoders (used for initial retrieval) which encode query and document independently, cross-encoders jointly encode the (query, document) pair, capturing fine-grained interactions. This significantly improves precision at the cost of being non-indexable, hence it is used only on a small candidate set.

Theory

Mathematical Basis

The cross-encoder produces a single relevance score for each (query, document) pair:

score = CrossEncoder(query, document) -> R

The top-K candidates are then selected by descending score.

Bi-Encoder vs. Cross-Encoder

Property	Bi-Encoder	Cross-Encoder
Encoding	Query and document encoded independently	Query and document encoded jointly
Interaction	Late interaction (dot product / cosine)	Early interaction (full self-attention)
Indexability	Can pre-compute document embeddings	Cannot pre-compute; requires query at inference
Speed	Fast (sub-linear with ANN index)	Slow (linear in number of candidates)
Accuracy	Good	Superior (captures cross-attention between query and document tokens)
Use case	Initial retrieval over large collection	Reranking a small candidate set

Two-Stage Retrieval

The two-stage approach combines the strengths of both models:

Stage 1 (Recall) - The bi-encoder performs fast ANN search to retrieve a broad candidate set (e.g., top 50-100 documents)
Stage 2 (Precision) - The cross-encoder re-scores the candidates with higher accuracy and selects the top-K (e.g., top 5-10) for final use

This achieves near-cross-encoder accuracy with near-bi-encoder latency.

When to Use

When improving retrieval precision by re-scoring initial vector search candidates
When the initial retrieval returns a manageable number of candidates (typically under 100)
When answer quality is more important than retrieval latency
When the application can afford the additional compute of running a cross-encoder model

Related Concepts

Multi-stage retrieval - cascading retrieval stages with increasing accuracy
ColBERT - late-interaction model that balances speed and accuracy
Two-tower models - architecture where query and document are encoded separately
Cross-attention - transformer mechanism that attends across two sequences

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment