Implementation:FlagOpen FlagEmbedding LLM Reranker Evaluate

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Reranking, Information_Retrieval, Evaluation
Last Updated	2026-02-09 00:00 GMT

Overview

Comprehensive evaluation script for reranking models computing MRR, Recall, nDCG, MAP, and Precision metrics.

Description

This implementation evaluates rerankers on datasets with query-positive-negative triplets:

Workflow: 1. Loads data with queries, positive passages, negative passages, and optional relevance scores 2. Uses FlagReranker to score all query-passage pairs 3. Ranks passages by scores 4. Computes standard retrieval metrics at multiple cutoffs (1, 5, 10, 50, 100)

Metrics:

MRR@k: Mean Reciprocal Rank - measures position of first relevant document
Recall@k: Proportion of relevant documents found in top-k
nDCG@k: Normalized Discounted Cumulative Gain - graded relevance metric
MAP@k: Mean Average Precision
Precision@k: Precision at cutoff k

Uses pytrec_eval for reliable metric computation. Handles pos_label_scores for graded relevance (defaulting to 1 if not provided). The evaluate_mrr() function provides an alternative MRR computation method.

Usage

Use this to evaluate reranking models on passage reranking tasks with comprehensive retrieval metrics across multiple cutoffs.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/llm_reranker/evaluate.py
Lines: 1-180

Signature

def evaluate_mrr(predicts, labels, cutoffs)

def main()  # Entry point with Args configuration

Import

from research.llm_reranker.evaluate import main

I/O Contract

Inputs

Name	Type	Required	Description
input_path	str	Yes	Path to JSONL with query, pos, neg fields
metrics	List[str]	No	Metrics to compute (default: recall, mrr, ndcg, map, precision)
k_values	List[int]	No	Cutoffs for metrics (default: 1, 5, 10, 50, 100)
cache_dir	str	No	Cache directory for reranker model
use_fp16	bool	No	Use FP16 for acceleration (default: True)
batch_size	int	No	Batch size for inference (default: 512)
max_length	int	No	Maximum sequence length (default: 1024)

Outputs

Name	Type	Description
MRR@k	float	Mean Reciprocal Rank at cutoffs
Recall@k	float	Recall at cutoffs
nDCG@k	float	Normalized DCG at cutoffs
MAP@k	float	Mean Average Precision at cutoffs
Precision@k	float	Precision at cutoffs

Usage Examples

# Command line usage
python research/llm_reranker/evaluate.py \
    --input_path rerank_data.jsonl \
    --metrics recall mrr ndcg \
    --k_values 1 5 10 50 100 \
    --use_fp16 \
    --batch_size 512 \
    --max_length 1024

# Data format (rerank_data.jsonl):
# {"query": "what is machine learning",
#  "pos": ["Machine learning is a field of AI..."],
#  "neg": ["Deep learning...", "Python programming...", ...],
#  "pos_label_scores": [2]}  # Optional graded relevance

# Results:
# {'MRR@10': 0.842}
# {'Recall@1': 0.678, 'Recall@5': 0.891, 'Recall@10': 0.945}
# {'NDCG@1': 0.678, 'NDCG@5': 0.823, 'NDCG@10': 0.867}
# {'MAP@10': 0.798}
# {'Precision@10': 0.124}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment