Workflow:FlagOpen FlagEmbedding Reranker Inference

Knowledge Sources	FlagEmbedding BGE Documentation
Domains	Text_Reranking, Information_Retrieval, NLP
Last Updated	2026-02-09 21:30 GMT

Overview

End-to-end process for loading a BGE reranker model and computing relevance scores for query-passage pairs to improve retrieval quality.

Description

This workflow covers the standard procedure for using BGE reranker models to re-score query-passage pairs. Unlike embedding models that encode queries and passages independently, rerankers take a concatenated query-passage pair as input and directly output a relevance score via cross-attention. The library supports four reranker types: encoder-only base (bge-reranker-base/large), cross-encoder M3-based (bge-reranker-v2-m3), LLM-based instruction-following (bge-reranker-v2-gemma), layerwise depth-adaptive (bge-reranker-v2-minicpm-layerwise), and lightweight compressed (bge-reranker-v2.5-gemma2-lightweight).

Usage

Execute this workflow after an initial retrieval stage (e.g., using embeddings or BM25) to re-rank a candidate set of passages for a given query. The reranker improves precision in the top results by performing deeper cross-attention between query and passage tokens, at higher computational cost per pair.

Execution Steps

Step 1: Install FlagEmbedding

Install the FlagEmbedding package. Reranker inference requires only the base package without finetune dependencies.

Key considerations:

Use pip install -U FlagEmbedding for inference-only usage
GPU support with CUDA is recommended for production throughput

Step 2: Load the Reranker Model

Initialize the reranker using the auto-loading factory. FlagAutoReranker.from_finetuned() detects the model architecture and instantiates the appropriate subclass (BaseReranker, BaseLLMReranker, LayerWiseFlagLLMReranker, or LightWeightFlagLLMReranker). Configure device placement, precision, and maximum sequence lengths.

Key considerations:

FlagAutoReranker.from_finetuned() auto-detects the model type from the model card
For custom models, specify model_class explicitly (encoder-only-base, decoder-only-base, decoder-only-layerwise, decoder-only-lightweight)
Layerwise models allow selecting output layers via cutoff_layers for speed-accuracy trade-off
Lightweight models support token compression via compress_ratio and compress_layers

Step 3: Prepare Query-Passage Pairs

Format the input as a list of (query, passage) tuples. Each pair will be jointly encoded by the reranker to produce a cross-attention relevance score. For batch processing, provide a list of lists.

Key considerations:

Input format is a list of [query, passage] pairs
Both query_max_length and passage_max_length control truncation
Single pairs return a scalar; multiple pairs return a list of scores

Step 4: Compute Relevance Scores

Feed the pairs to compute_score() to obtain raw relevance scores. Optionally apply sigmoid normalization to map scores to the [0, 1] range. Higher scores indicate stronger query-passage relevance.

Key considerations:

Set normalize=True to apply sigmoid and get scores in [0, 1]
Raw scores are unbounded logits useful for relative ranking
Multi-GPU inference is supported via the devices parameter
Batch size is controlled by reranker_batch_size for throughput tuning

Execution Diagram

GitHub URL

Workflow Repository