Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:FlagOpen FlagEmbedding Benchmark Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, Information_Retrieval, Benchmarking
Last Updated 2026-02-09 21:30 GMT

Overview

End-to-end process for evaluating BGE embedding and reranker models on standard information retrieval benchmarks including MTEB, BEIR, MSMARCO, MIRACL, MLDR, MKQA, AIR-Bench, BRIGHT, and custom datasets.

Description

This workflow covers the standard evaluation pipeline for BGE models across multiple retrieval benchmarks. The evaluation framework follows a consistent pattern: load the model, encode the corpus and queries, perform top-k retrieval using FAISS indexing, optionally rerank with a cross-encoder, and compute metrics (nDCG, MRR, Recall). Each benchmark has a dedicated runner module accessible via python -m FlagEmbedding.evaluation.{benchmark}. The framework supports multi-GPU corpus encoding, both embedder-only and embedder+reranker evaluation pipelines, and multiple output formats.

Usage

Execute this workflow after fine-tuning a model to measure its performance on standard benchmarks, or to compare different models on the same evaluation suite. Also useful for establishing baselines before fine-tuning or for validating that a fine-tuned model has not regressed on general-purpose benchmarks.

Execution Steps

Step 1: Install Evaluation Dependencies

Install the FlagEmbedding package along with evaluation-specific dependencies: pytrec_eval for retrieval metrics and faiss-gpu for efficient similarity search.

Key considerations:

  • Install pytrec_eval (or pytrec-eval-terrier as fallback)
  • Install faiss-gpu for GPU-accelerated nearest neighbor search
  • MTEB evaluation requires the mteb package

Step 2: Select Benchmark and Configure Parameters

Choose the target benchmark and configure evaluation parameters including dataset directory, dataset names/languages, splits, top-k values, metric selections, and output format. Each benchmark has specific dataset configurations and default settings.

Supported benchmarks:

  • MTEB: Multi-task embedding benchmark (English/multilingual, multiple task types)
  • BEIR: Heterogeneous retrieval benchmark (18 datasets)
  • MSMARCO: Microsoft passage/document retrieval (dev, dl19, dl20 splits)
  • MIRACL: Multilingual information retrieval across 18 languages
  • MLDR: Multilingual long document retrieval
  • MKQA: Cross-lingual question answering across 25 languages
  • AIR-Bench: Out-of-distribution retrieval benchmark
  • BRIGHT: Reasoning-intensive retrieval
  • Custom: User-defined datasets with corpus.jsonl, queries.jsonl, qrels.jsonl

Key considerations:

  • dataset_dir can be a local path or a download location
  • dataset_names selects specific subsets (e.g., language codes for MIRACL)
  • eval_metrics defaults to ndcg_at_10 and recall_at_100

Step 3: Load Models

Specify the embedder and optionally a reranker for two-stage evaluation. The evaluation runner loads both models and distributes them across available GPUs. Model-specific parameters (model_class, pooling_method, instructions) must be configured to match the model architecture.

Key considerations:

  • Set embedder_name_or_path and optionally reranker_name_or_path
  • For custom models, set embedder_model_class and reranker_model_class explicitly
  • Multi-GPU support via the devices parameter distributes corpus encoding
  • Batch sizes (embedder_batch_size, reranker_batch_size) affect memory and throughput

Step 4: Encode Corpus and Run Retrieval

The runner encodes the full corpus into embeddings (optionally saving to disk for reuse), encodes queries with appropriate instructions, and performs top-k nearest neighbor search using FAISS indexing. For benchmarks with multiple datasets or languages, this process repeats for each subset.

Key considerations:

  • corpus_embd_save_dir enables caching corpus embeddings for repeated evaluations
  • search_top_k controls initial retrieval depth (typically 1000)
  • Corpus encoding is distributed across GPUs for large datasets
  • Embeddings are optionally normalized for cosine similarity search

Step 5: Rerank Results (Optional)

If a reranker is specified, the top-k retrieved passages are re-scored using the cross-encoder. The reranker processes query-passage pairs jointly and produces more accurate relevance scores for the candidate set.

Key considerations:

  • rerank_top_k controls how many candidates from retrieval are reranked (typically 100)
  • Reranking is significantly slower than retrieval but improves precision
  • Layerwise and lightweight rerankers offer speed-accuracy trade-offs

Step 6: Compute Metrics and Output Results

Evaluate the ranked results against ground-truth relevance labels using standard IR metrics. Output results in JSON or markdown format for analysis and comparison.

Key considerations:

  • Standard metrics include nDCG@k, MRR@k, Recall@k, MAP@k
  • k_values parameter controls which cutoffs are reported
  • eval_output_method supports json and markdown formats
  • Results are saved to eval_output_path for persistence

Execution Diagram

GitHub URL

Workflow Repository