Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepset ai Haystack Cross Encoder Reranking

From Leeroopedia

Metadata

Field Value
Principle Name Cross-Encoder Reranking
Domains Information_Retrieval, NLP
Related Implementation Deepset_ai_Haystack_TransformersSimilarityRanker
Source Reference haystack/components/rankers/transformers_similarity.py:L24-328
Repository Deepset_ai_Haystack

Overview

Cross-encoder reranking uses a transformer model that jointly encodes a query-document pair to produce a relevance score, providing higher accuracy than bi-encoder retrieval at the cost of speed. It is typically employed as a second-stage reranker after an initial fast retrieval step (BM25 or embedding-based), refining the ranking of a small candidate set with more expressive cross-attention scoring.

Description

In a multi-stage retrieval pipeline, the first stage (BM25, embedding retrieval, or both) efficiently narrows the full document collection down to a manageable candidate set (typically 10-100 documents). The second stage then applies a more accurate but computationally expensive model to rerank these candidates.

Cross-encoder reranking serves this second-stage role. Unlike bi-encoders that embed queries and documents independently, a cross-encoder takes the concatenated query-document pair as a single input to the transformer. This allows the model to compute full cross-attention between all query tokens and all document tokens, producing a more nuanced relevance judgment.

Key characteristics:

  • Joint encoding: The query and document are processed together as a single sequence [CLS] query [SEP] document [SEP], enabling rich token-level interactions.
  • Higher accuracy: Full cross-attention captures subtle relevance signals that independent encoding misses, such as negation, entity relationships, and contextual nuance.
  • No precomputation: Because the model requires both query and document as input, scores cannot be precomputed. Each (query, document) pair requires a separate forward pass.
  • Quadratic cost: Scoring N documents requires N forward passes through the transformer. This makes cross-encoders impractical for searching the full corpus but ideal for reranking a small candidate set.

Theoretical Basis

Bi-Encoder vs. Cross-Encoder

The fundamental tradeoff in neural retrieval is between efficiency and accuracy:

Property Bi-Encoder Cross-Encoder
Encoding Query and document encoded independently Query and document encoded jointly
Interaction Late interaction (similarity of final vectors) Early interaction (full cross-attention between tokens)
Precomputation Document embeddings can be precomputed No precomputation possible
Scalability Sublinear with ANN indices Linear in the number of candidate documents
Accuracy Good Superior (captures fine-grained relevance)

The bi-encoder produces a single vector per input, losing token-level information. The cross-encoder preserves all token interactions, enabling it to distinguish cases like:

  • "Python is not a compiled language" (negation)
  • "The bank of the river" vs. "The financial bank" (disambiguation)

Cross-Encoder Architecture

A cross-encoder is a standard transformer model (typically based on BERT, RoBERTa, or similar) fine-tuned for sequence pair classification:

  1. Input: The query and document are concatenated with special separator tokens: [CLS] query_tokens [SEP] document_tokens [SEP].
  2. Encoding: The full sequence is passed through the transformer, where all tokens attend to all other tokens via self-attention.
  3. Scoring: The [CLS] token representation is passed through a linear layer to produce a single scalar relevance score (logit).
  4. Calibration: Optionally, the raw logit is passed through a sigmoid function to produce a probability-like score in the range [0, 1].

Score Calibration

Raw cross-encoder logits can have arbitrary magnitude and are not directly interpretable. The calibration step applies:

score = sigmoid(logit * calibration_factor)

Where:

  • sigmoid(x) = 1 / (1 + exp(-x)) maps the logit to a [0, 1] range.
  • calibration_factor controls the sharpness of the sigmoid. A factor of 1.0 uses the standard sigmoid; smaller factors produce scores closer to 0.5 (less confident); larger factors push scores toward 0 or 1.

Two-Stage Retrieval

Cross-encoder reranking is designed for the retrieve-then-rerank pattern:

  1. Stage 1 (Retrieval): A fast retriever (BM25 or bi-encoder) retrieves the top N candidate documents from the full corpus. This is efficient because BM25 uses inverted indices and bi-encoders use precomputed vectors with ANN search.
  2. Stage 2 (Reranking): The cross-encoder scores each of the N candidates against the query and reorders them by relevance. Because N is small (typically 10-100), the quadratic cost of cross-encoding is manageable.

This two-stage approach combines the efficiency of first-stage retrieval with the accuracy of cross-encoder scoring.

Usage

Cross-encoder reranking is used as a second stage in retrieval pipelines. A typical pipeline with reranking consists of:

  1. A retriever (BM25 or embedding-based) that returns an initial candidate set.
  2. A TransformersSimilarityRanker that reranks the candidates using a cross-encoder model.
from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.document_stores.in_memory import InMemoryDocumentStore

doc_store = InMemoryDocumentStore()
doc_store.write_documents([
    Document(content="Berlin is the capital of Germany"),
    Document(content="Paris is known for the Eiffel Tower"),
    Document(content="Germany is located in central Europe"),
])

pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryBM25Retriever(document_store=doc_store, top_k=10))
pipeline.add_component("ranker", TransformersSimilarityRanker(top_k=3))
pipeline.connect("retriever.documents", "ranker.documents")

result = pipeline.run({
    "retriever": {"query": "German capital city"},
    "ranker": {"query": "German capital city"},
})
for doc in result["ranker"]["documents"]:
    print(doc.content, doc.score)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment