Implementation:FlagOpen FlagEmbedding LLM Reranker Evaluate
| Knowledge Sources | |
|---|---|
| Domains | Reranking, Information_Retrieval, Evaluation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Comprehensive evaluation script for reranking models computing MRR, Recall, nDCG, MAP, and Precision metrics.
Description
This implementation evaluates rerankers on datasets with query-positive-negative triplets:
Workflow: 1. Loads data with queries, positive passages, negative passages, and optional relevance scores 2. Uses FlagReranker to score all query-passage pairs 3. Ranks passages by scores 4. Computes standard retrieval metrics at multiple cutoffs (1, 5, 10, 50, 100)
Metrics:
- MRR@k: Mean Reciprocal Rank - measures position of first relevant document
- Recall@k: Proportion of relevant documents found in top-k
- nDCG@k: Normalized Discounted Cumulative Gain - graded relevance metric
- MAP@k: Mean Average Precision
- Precision@k: Precision at cutoff k
Uses pytrec_eval for reliable metric computation. Handles pos_label_scores for graded relevance (defaulting to 1 if not provided). The evaluate_mrr() function provides an alternative MRR computation method.
Usage
Use this to evaluate reranking models on passage reranking tasks with comprehensive retrieval metrics across multiple cutoffs.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_reranker/evaluate.py
- Lines: 1-180
Signature
def evaluate_mrr(predicts, labels, cutoffs)
def main() # Entry point with Args configuration
Import
from research.llm_reranker.evaluate import main
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_path | str | Yes | Path to JSONL with query, pos, neg fields |
| metrics | List[str] | No | Metrics to compute (default: recall, mrr, ndcg, map, precision) |
| k_values | List[int] | No | Cutoffs for metrics (default: 1, 5, 10, 50, 100) |
| cache_dir | str | No | Cache directory for reranker model |
| use_fp16 | bool | No | Use FP16 for acceleration (default: True) |
| batch_size | int | No | Batch size for inference (default: 512) |
| max_length | int | No | Maximum sequence length (default: 1024) |
Outputs
| Name | Type | Description |
|---|---|---|
| MRR@k | float | Mean Reciprocal Rank at cutoffs |
| Recall@k | float | Recall at cutoffs |
| nDCG@k | float | Normalized DCG at cutoffs |
| MAP@k | float | Mean Average Precision at cutoffs |
| Precision@k | float | Precision at cutoffs |
Usage Examples
# Command line usage
python research/llm_reranker/evaluate.py \
--input_path rerank_data.jsonl \
--metrics recall mrr ndcg \
--k_values 1 5 10 50 100 \
--use_fp16 \
--batch_size 512 \
--max_length 1024
# Data format (rerank_data.jsonl):
# {"query": "what is machine learning",
# "pos": ["Machine learning is a field of AI..."],
# "neg": ["Deep learning...", "Python programming...", ...],
# "pos_label_scores": [2]} # Optional graded relevance
# Results:
# {'MRR@10': 0.842}
# {'Recall@1': 0.678, 'Recall@5': 0.891, 'Recall@10': 0.945}
# {'NDCG@1': 0.678, 'NDCG@5': 0.823, 'NDCG@10': 0.867}
# {'MAP@10': 0.798}
# {'Precision@10': 0.124}