Principle:FlagOpen FlagEmbedding Evaluation Configuration
| Sources | Repo: FlagOpen/FlagEmbedding, Doc: BEIR Benchmark |
|---|---|
| Domains | Information_Retrieval, Evaluation |
Overview
A configuration system that defines evaluation task parameters, model paths, and metric specifications for benchmarking BGE embedding and reranker models.
Description
FlagEmbedding's evaluation framework uses two dataclass groups to separate concerns between task configuration and model configuration:
AbsEvalArgs defines the evaluation task parameters:
- eval_name: The name of the evaluation task (e.g., msmarco, beir, miracl)
- dataset_dir: Path to the dataset directory or download location
- dataset_names: Specific dataset names to evaluate (supports multiple via nargs="+")
- splits: Dataset splits to evaluate (default: "test", supports multiple via nargs="+")
- output_dir: Path to save search results (default: "./search_results")
- search_top_k: Number of top results for retrieval (default: 1000)
- rerank_top_k: Number of top results for reranking (default: 100)
- k_values: k values for metric computation (default: [1, 3, 5, 10, 100, 1000])
- eval_output_method: Output format, either "json" or "markdown" (default: "markdown")
- eval_output_path: Path to save evaluation results (default: "./eval_results.md")
- eval_metrics: Metrics to compute (default: ["ndcg_at_10", "recall_at_10"])
AbsEvalModelArgs defines the model configuration:
- embedder_name_or_path: Path or name of the embedding model (required)
- embedder_model_class: Model class for the embedder (choices: encoder-only-base, encoder-only-m3, decoder-only-base, decoder-only-icl)
- reranker_name_or_path: Path or name of the reranker model (optional)
- reranker_model_class: Model class for the reranker (choices: encoder-only-base, decoder-only-base, decoder-only-layerwise, decoder-only-lightweight)
- embedder_batch_size / reranker_batch_size: Batch sizes for inference (default: 3000)
- embedder_query_max_length / embedder_passage_max_length: Max token lengths (default: 512)
- reranker_max_length: Max length for reranking (default: 512)
- devices: Device configuration for multi-GPU inference
- use_fp16 / use_bf16: Precision configuration
The framework supports 9 benchmarks: BEIR, MSMARCO, MIRACL, MLDR, MKQA, AIR-Bench, MTEB, BRIGHT, and custom datasets.
Usage
Before running evaluation to configure which benchmarks, models, and metrics to use. The two argument dataclasses are typically parsed from command-line arguments using HuggingFace's HfArgumentParser and passed to the AbsEvalRunner.
Theoretical Basis
Standardized evaluation is essential for reproducibility in information retrieval research. The configuration system separates task configuration from model configuration, enabling mix-and-match evaluation where any supported embedding model can be evaluated against any supported benchmark with any set of metrics. This separation of concerns follows the principle that evaluation infrastructure should be orthogonal to the models being evaluated, allowing researchers to systematically compare models across diverse benchmarks without rewriting evaluation code.