Workflow:FlagOpen FlagEmbedding Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Information_Retrieval, Benchmarking |
| Last Updated | 2026-02-09 21:30 GMT |
Overview
End-to-end process for evaluating BGE embedding and reranker models on standard information retrieval benchmarks including MTEB, BEIR, MSMARCO, MIRACL, MLDR, MKQA, AIR-Bench, BRIGHT, and custom datasets.
Description
This workflow covers the standard evaluation pipeline for BGE models across multiple retrieval benchmarks. The evaluation framework follows a consistent pattern: load the model, encode the corpus and queries, perform top-k retrieval using FAISS indexing, optionally rerank with a cross-encoder, and compute metrics (nDCG, MRR, Recall). Each benchmark has a dedicated runner module accessible via python -m FlagEmbedding.evaluation.{benchmark}. The framework supports multi-GPU corpus encoding, both embedder-only and embedder+reranker evaluation pipelines, and multiple output formats.
Usage
Execute this workflow after fine-tuning a model to measure its performance on standard benchmarks, or to compare different models on the same evaluation suite. Also useful for establishing baselines before fine-tuning or for validating that a fine-tuned model has not regressed on general-purpose benchmarks.
Execution Steps
Step 1: Install Evaluation Dependencies
Install the FlagEmbedding package along with evaluation-specific dependencies: pytrec_eval for retrieval metrics and faiss-gpu for efficient similarity search.
Key considerations:
- Install pytrec_eval (or pytrec-eval-terrier as fallback)
- Install faiss-gpu for GPU-accelerated nearest neighbor search
- MTEB evaluation requires the mteb package
Step 2: Select Benchmark and Configure Parameters
Choose the target benchmark and configure evaluation parameters including dataset directory, dataset names/languages, splits, top-k values, metric selections, and output format. Each benchmark has specific dataset configurations and default settings.
Supported benchmarks:
- MTEB: Multi-task embedding benchmark (English/multilingual, multiple task types)
- BEIR: Heterogeneous retrieval benchmark (18 datasets)
- MSMARCO: Microsoft passage/document retrieval (dev, dl19, dl20 splits)
- MIRACL: Multilingual information retrieval across 18 languages
- MLDR: Multilingual long document retrieval
- MKQA: Cross-lingual question answering across 25 languages
- AIR-Bench: Out-of-distribution retrieval benchmark
- BRIGHT: Reasoning-intensive retrieval
- Custom: User-defined datasets with corpus.jsonl, queries.jsonl, qrels.jsonl
Key considerations:
- dataset_dir can be a local path or a download location
- dataset_names selects specific subsets (e.g., language codes for MIRACL)
- eval_metrics defaults to ndcg_at_10 and recall_at_100
Step 3: Load Models
Specify the embedder and optionally a reranker for two-stage evaluation. The evaluation runner loads both models and distributes them across available GPUs. Model-specific parameters (model_class, pooling_method, instructions) must be configured to match the model architecture.
Key considerations:
- Set embedder_name_or_path and optionally reranker_name_or_path
- For custom models, set embedder_model_class and reranker_model_class explicitly
- Multi-GPU support via the devices parameter distributes corpus encoding
- Batch sizes (embedder_batch_size, reranker_batch_size) affect memory and throughput
Step 4: Encode Corpus and Run Retrieval
The runner encodes the full corpus into embeddings (optionally saving to disk for reuse), encodes queries with appropriate instructions, and performs top-k nearest neighbor search using FAISS indexing. For benchmarks with multiple datasets or languages, this process repeats for each subset.
Key considerations:
- corpus_embd_save_dir enables caching corpus embeddings for repeated evaluations
- search_top_k controls initial retrieval depth (typically 1000)
- Corpus encoding is distributed across GPUs for large datasets
- Embeddings are optionally normalized for cosine similarity search
Step 5: Rerank Results (Optional)
If a reranker is specified, the top-k retrieved passages are re-scored using the cross-encoder. The reranker processes query-passage pairs jointly and produces more accurate relevance scores for the candidate set.
Key considerations:
- rerank_top_k controls how many candidates from retrieval are reranked (typically 100)
- Reranking is significantly slower than retrieval but improves precision
- Layerwise and lightweight rerankers offer speed-accuracy trade-offs
Step 6: Compute Metrics and Output Results
Evaluate the ranked results against ground-truth relevance labels using standard IR metrics. Output results in JSON or markdown format for analysis and comparison.
Key considerations:
- Standard metrics include nDCG@k, MRR@k, Recall@k, MAP@k
- k_values parameter controls which cutoffs are reported
- eval_output_method supports json and markdown formats
- Results are saved to eval_output_path for persistence