Workflow:FlagOpen FlagEmbedding Reranker Inference
| Knowledge Sources | |
|---|---|
| Domains | Text_Reranking, Information_Retrieval, NLP |
| Last Updated | 2026-02-09 21:30 GMT |
Overview
End-to-end process for loading a BGE reranker model and computing relevance scores for query-passage pairs to improve retrieval quality.
Description
This workflow covers the standard procedure for using BGE reranker models to re-score query-passage pairs. Unlike embedding models that encode queries and passages independently, rerankers take a concatenated query-passage pair as input and directly output a relevance score via cross-attention. The library supports four reranker types: encoder-only base (bge-reranker-base/large), cross-encoder M3-based (bge-reranker-v2-m3), LLM-based instruction-following (bge-reranker-v2-gemma), layerwise depth-adaptive (bge-reranker-v2-minicpm-layerwise), and lightweight compressed (bge-reranker-v2.5-gemma2-lightweight).
Usage
Execute this workflow after an initial retrieval stage (e.g., using embeddings or BM25) to re-rank a candidate set of passages for a given query. The reranker improves precision in the top results by performing deeper cross-attention between query and passage tokens, at higher computational cost per pair.
Execution Steps
Step 1: Install FlagEmbedding
Install the FlagEmbedding package. Reranker inference requires only the base package without finetune dependencies.
Key considerations:
- Use
pip install -U FlagEmbeddingfor inference-only usage - GPU support with CUDA is recommended for production throughput
Step 2: Load the Reranker Model
Initialize the reranker using the auto-loading factory. FlagAutoReranker.from_finetuned() detects the model architecture and instantiates the appropriate subclass (BaseReranker, BaseLLMReranker, LayerWiseFlagLLMReranker, or LightWeightFlagLLMReranker). Configure device placement, precision, and maximum sequence lengths.
Key considerations:
- FlagAutoReranker.from_finetuned() auto-detects the model type from the model card
- For custom models, specify model_class explicitly (encoder-only-base, decoder-only-base, decoder-only-layerwise, decoder-only-lightweight)
- Layerwise models allow selecting output layers via cutoff_layers for speed-accuracy trade-off
- Lightweight models support token compression via compress_ratio and compress_layers
Step 3: Prepare Query-Passage Pairs
Format the input as a list of (query, passage) tuples. Each pair will be jointly encoded by the reranker to produce a cross-attention relevance score. For batch processing, provide a list of lists.
Key considerations:
- Input format is a list of [query, passage] pairs
- Both query_max_length and passage_max_length control truncation
- Single pairs return a scalar; multiple pairs return a list of scores
Step 4: Compute Relevance Scores
Feed the pairs to compute_score() to obtain raw relevance scores. Optionally apply sigmoid normalization to map scores to the [0, 1] range. Higher scores indicate stronger query-passage relevance.
Key considerations:
- Set normalize=True to apply sigmoid and get scores in [0, 1]
- Raw scores are unbounded logits useful for relative ranking
- Multi-GPU inference is supported via the devices parameter
- Batch size is controlled by reranker_batch_size for throughput tuning