Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:FlagOpen FlagEmbedding Reranker Inference

From Leeroopedia
Revision as of 11:04, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/FlagOpen_FlagEmbedding_Reranker_Inference.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Text_Reranking, Information_Retrieval, NLP
Last Updated 2026-02-09 21:30 GMT

Overview

End-to-end process for loading a BGE reranker model and computing relevance scores for query-passage pairs to improve retrieval quality.

Description

This workflow covers the standard procedure for using BGE reranker models to re-score query-passage pairs. Unlike embedding models that encode queries and passages independently, rerankers take a concatenated query-passage pair as input and directly output a relevance score via cross-attention. The library supports four reranker types: encoder-only base (bge-reranker-base/large), cross-encoder M3-based (bge-reranker-v2-m3), LLM-based instruction-following (bge-reranker-v2-gemma), layerwise depth-adaptive (bge-reranker-v2-minicpm-layerwise), and lightweight compressed (bge-reranker-v2.5-gemma2-lightweight).

Usage

Execute this workflow after an initial retrieval stage (e.g., using embeddings or BM25) to re-rank a candidate set of passages for a given query. The reranker improves precision in the top results by performing deeper cross-attention between query and passage tokens, at higher computational cost per pair.

Execution Steps

Step 1: Install FlagEmbedding

Install the FlagEmbedding package. Reranker inference requires only the base package without finetune dependencies.

Key considerations:

  • Use pip install -U FlagEmbedding for inference-only usage
  • GPU support with CUDA is recommended for production throughput

Step 2: Load the Reranker Model

Initialize the reranker using the auto-loading factory. FlagAutoReranker.from_finetuned() detects the model architecture and instantiates the appropriate subclass (BaseReranker, BaseLLMReranker, LayerWiseFlagLLMReranker, or LightWeightFlagLLMReranker). Configure device placement, precision, and maximum sequence lengths.

Key considerations:

  • FlagAutoReranker.from_finetuned() auto-detects the model type from the model card
  • For custom models, specify model_class explicitly (encoder-only-base, decoder-only-base, decoder-only-layerwise, decoder-only-lightweight)
  • Layerwise models allow selecting output layers via cutoff_layers for speed-accuracy trade-off
  • Lightweight models support token compression via compress_ratio and compress_layers

Step 3: Prepare Query-Passage Pairs

Format the input as a list of (query, passage) tuples. Each pair will be jointly encoded by the reranker to produce a cross-attention relevance score. For batch processing, provide a list of lists.

Key considerations:

  • Input format is a list of [query, passage] pairs
  • Both query_max_length and passage_max_length control truncation
  • Single pairs return a scalar; multiple pairs return a list of scores

Step 4: Compute Relevance Scores

Feed the pairs to compute_score() to obtain raw relevance scores. Optionally apply sigmoid normalization to map scores to the [0, 1] range. Higher scores indicate stronger query-passage relevance.

Key considerations:

  • Set normalize=True to apply sigmoid and get scores in [0, 1]
  • Raw scores are unbounded logits useful for relative ranking
  • Multi-GPU inference is supported via the devices parameter
  • Batch size is controlled by reranker_batch_size for throughput tuning

Execution Diagram

GitHub URL

Workflow Repository