Workflow:FlagOpen FlagEmbedding Benchmark Evaluation

Knowledge Sources	FlagEmbedding BGE Documentation
Domains	Evaluation, Information_Retrieval, Benchmarking
Last Updated	2026-02-09 21:30 GMT

Overview

End-to-end process for evaluating BGE embedding and reranker models on standard information retrieval benchmarks including MTEB, BEIR, MSMARCO, MIRACL, MLDR, MKQA, AIR-Bench, BRIGHT, and custom datasets.

Description

This workflow covers the standard evaluation pipeline for BGE models across multiple retrieval benchmarks. The evaluation framework follows a consistent pattern: load the model, encode the corpus and queries, perform top-k retrieval using FAISS indexing, optionally rerank with a cross-encoder, and compute metrics (nDCG, MRR, Recall). Each benchmark has a dedicated runner module accessible via python -m FlagEmbedding.evaluation.{benchmark}. The framework supports multi-GPU corpus encoding, both embedder-only and embedder+reranker evaluation pipelines, and multiple output formats.

Usage

Execute this workflow after fine-tuning a model to measure its performance on standard benchmarks, or to compare different models on the same evaluation suite. Also useful for establishing baselines before fine-tuning or for validating that a fine-tuned model has not regressed on general-purpose benchmarks.

Execution Steps

Step 1: Install Evaluation Dependencies

Install the FlagEmbedding package along with evaluation-specific dependencies: pytrec_eval for retrieval metrics and faiss-gpu for efficient similarity search.

Key considerations:

Install pytrec_eval (or pytrec-eval-terrier as fallback)
Install faiss-gpu for GPU-accelerated nearest neighbor search
MTEB evaluation requires the mteb package

Step 2: Select Benchmark and Configure Parameters

Choose the target benchmark and configure evaluation parameters including dataset directory, dataset names/languages, splits, top-k values, metric selections, and output format. Each benchmark has specific dataset configurations and default settings.

Supported benchmarks:

MTEB: Multi-task embedding benchmark (English/multilingual, multiple task types)
BEIR: Heterogeneous retrieval benchmark (18 datasets)
MSMARCO: Microsoft passage/document retrieval (dev, dl19, dl20 splits)
MIRACL: Multilingual information retrieval across 18 languages
MLDR: Multilingual long document retrieval
MKQA: Cross-lingual question answering across 25 languages
AIR-Bench: Out-of-distribution retrieval benchmark
BRIGHT: Reasoning-intensive retrieval
Custom: User-defined datasets with corpus.jsonl, queries.jsonl, qrels.jsonl

Key considerations:

dataset_dir can be a local path or a download location
dataset_names selects specific subsets (e.g., language codes for MIRACL)
eval_metrics defaults to ndcg_at_10 and recall_at_100

Step 3: Load Models

Specify the embedder and optionally a reranker for two-stage evaluation. The evaluation runner loads both models and distributes them across available GPUs. Model-specific parameters (model_class, pooling_method, instructions) must be configured to match the model architecture.

Key considerations:

Set embedder_name_or_path and optionally reranker_name_or_path
For custom models, set embedder_model_class and reranker_model_class explicitly
Multi-GPU support via the devices parameter distributes corpus encoding
Batch sizes (embedder_batch_size, reranker_batch_size) affect memory and throughput

Step 4: Encode Corpus and Run Retrieval

The runner encodes the full corpus into embeddings (optionally saving to disk for reuse), encodes queries with appropriate instructions, and performs top-k nearest neighbor search using FAISS indexing. For benchmarks with multiple datasets or languages, this process repeats for each subset.

Key considerations:

corpus_embd_save_dir enables caching corpus embeddings for repeated evaluations
search_top_k controls initial retrieval depth (typically 1000)
Corpus encoding is distributed across GPUs for large datasets
Embeddings are optionally normalized for cosine similarity search

Step 5: Rerank Results (Optional)

If a reranker is specified, the top-k retrieved passages are re-scored using the cross-encoder. The reranker processes query-passage pairs jointly and produces more accurate relevance scores for the candidate set.

Key considerations:

rerank_top_k controls how many candidates from retrieval are reranked (typically 100)
Reranking is significantly slower than retrieval but improves precision
Layerwise and lightweight rerankers offer speed-accuracy trade-offs

Step 6: Compute Metrics and Output Results

Evaluate the ranked results against ground-truth relevance labels using standard IR metrics. Output results in JSON or markdown format for analysis and comparison.

Key considerations:

Standard metrics include nDCG@k, MRR@k, Recall@k, MAP@k
k_values parameter controls which cutoffs are reported
eval_output_method supports json and markdown formats
Results are saved to eval_output_path for persistence

Execution Diagram

GitHub URL

Workflow Repository