Principle:FlagOpen FlagEmbedding Evaluation Configuration

Sources	Repo: FlagOpen/FlagEmbedding, Doc: BEIR Benchmark
Domains	Information_Retrieval, Evaluation

Overview

A configuration system that defines evaluation task parameters, model paths, and metric specifications for benchmarking BGE embedding and reranker models.

Description

FlagEmbedding's evaluation framework uses two dataclass groups to separate concerns between task configuration and model configuration:

AbsEvalArgs defines the evaluation task parameters:

eval_name: The name of the evaluation task (e.g., msmarco, beir, miracl)
dataset_dir: Path to the dataset directory or download location
dataset_names: Specific dataset names to evaluate (supports multiple via nargs="+")
splits: Dataset splits to evaluate (default: "test", supports multiple via nargs="+")
output_dir: Path to save search results (default: "./search_results")
search_top_k: Number of top results for retrieval (default: 1000)
rerank_top_k: Number of top results for reranking (default: 100)
k_values: k values for metric computation (default: [1, 3, 5, 10, 100, 1000])
eval_output_method: Output format, either "json" or "markdown" (default: "markdown")
eval_output_path: Path to save evaluation results (default: "./eval_results.md")
eval_metrics: Metrics to compute (default: ["ndcg_at_10", "recall_at_10"])

AbsEvalModelArgs defines the model configuration:

embedder_name_or_path: Path or name of the embedding model (required)
embedder_model_class: Model class for the embedder (choices: encoder-only-base, encoder-only-m3, decoder-only-base, decoder-only-icl)
reranker_name_or_path: Path or name of the reranker model (optional)
reranker_model_class: Model class for the reranker (choices: encoder-only-base, decoder-only-base, decoder-only-layerwise, decoder-only-lightweight)
embedder_batch_size / reranker_batch_size: Batch sizes for inference (default: 3000)
embedder_query_max_length / embedder_passage_max_length: Max token lengths (default: 512)
reranker_max_length: Max length for reranking (default: 512)
devices: Device configuration for multi-GPU inference
use_fp16 / use_bf16: Precision configuration

The framework supports 9 benchmarks: BEIR, MSMARCO, MIRACL, MLDR, MKQA, AIR-Bench, MTEB, BRIGHT, and custom datasets.

Usage

Before running evaluation to configure which benchmarks, models, and metrics to use. The two argument dataclasses are typically parsed from command-line arguments using HuggingFace's HfArgumentParser and passed to the AbsEvalRunner.

Theoretical Basis

Standardized evaluation is essential for reproducibility in information retrieval research. The configuration system separates task configuration from model configuration, enabling mix-and-match evaluation where any supported embedding model can be evaluated against any supported benchmark with any set of metrics. This separation of concerns follows the principle that evaluation infrastructure should be orthogonal to the models being evaluated, allowing researchers to systematically compare models across diverse benchmarks without rewriting evaluation code.

Related Pages

Implementation:FlagOpen_FlagEmbedding_AbsEvalArgs_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment