Implementation:FlagOpen FlagEmbedding BGE Eval MSMARCO
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Information_Retrieval, Evaluation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Evaluation script for dense retrieval models on the MS MARCO passage ranking dataset using FAISS indexing.
Description
This implementation provides a complete evaluation pipeline for embedding-based retrieval systems on MS MARCO. It includes three main functions:
index() encodes an entire corpus into dense embeddings using a FlagModel, creates a FAISS index for efficient similarity search, and optionally saves embeddings to disk using memory-mapped arrays for large-scale corpora. It supports GPU acceleration and various FAISS index factories.
search() encodes queries and performs similarity search through the FAISS index to retrieve top-k candidate passages. It processes queries in batches and returns both scores and indices of retrieved passages.
evaluate() computes standard retrieval metrics including MRR (Mean Reciprocal Rank), Recall, AUC, and nDCG at various cutoffs (1, 10, 100). It uses sklearn for AUC and nDCG calculations and returns a comprehensive metrics dictionary.
Usage
Use this for evaluating dense retrieval models on MS MARCO or similar passage ranking benchmarks where you need to measure retrieval quality across multiple metrics.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/baai_general_embedding/finetune/eval_msmarco.py
- Lines: 1-266
Signature
def index(model: FlagModel, corpus: datasets.Dataset, batch_size: int = 256,
max_length: int = 512, index_factory: str = "Flat",
save_path: str = None, save_embedding: bool = False,
load_embedding: bool = False)
def search(model: FlagModel, queries: datasets, faiss_index: faiss.Index,
k: int = 100, batch_size: int = 256, max_length: int = 512)
def evaluate(preds, preds_scores, labels, cutoffs=[1, 10, 100])
Import
from research.baai_general_embedding.finetune.eval_msmarco import index, search, evaluate
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | FlagModel | Yes | Embedding model for encoding queries and passages |
| corpus | datasets.Dataset | Yes | Dataset containing corpus documents with 'content' field |
| queries | datasets.Dataset | Yes | Dataset containing queries with 'query' field |
| faiss_index | faiss.Index | Yes | FAISS index built from corpus embeddings |
| preds | List[List] | Yes | Predicted passage IDs for each query |
| preds_scores | np.ndarray | Yes | Similarity scores for predictions |
| labels | List[List] | Yes | Ground truth positive passage IDs |
Outputs
| Name | Type | Description |
|---|---|---|
| faiss_index | faiss.Index | FAISS index for similarity search |
| scores | np.ndarray | Similarity scores, shape (num_queries, k) |
| indices | np.ndarray | Retrieved passage indices, shape (num_queries, k) |
| metrics | Dict | Dictionary with MRR, Recall, AUC, and nDCG metrics at various cutoffs |
Usage Examples
import datasets
from FlagEmbedding import FlagModel
from research.baai_general_embedding.finetune.eval_msmarco import index, search, evaluate
# Load model and data
model = FlagModel("BAAI/bge-base-en-v1.5", use_fp16=True)
corpus = datasets.load_dataset("namespace-Pt/msmarco-corpus", split="train")
eval_data = datasets.load_dataset("namespace-Pt/msmarco", split="dev")
# Build index
faiss_index = index(
model=model,
corpus=corpus,
batch_size=256,
max_length=128,
index_factory="Flat"
)
# Search
scores, indices = search(
model=model,
queries=eval_data,
faiss_index=faiss_index,
k=100,
batch_size=256
)
# Evaluate
retrieval_results = []
for indice in indices:
indice = indice[indice != -1].tolist()
retrieval_results.append(corpus[indice]["content"])
ground_truths = [sample["positive"] for sample in eval_data]
metrics = evaluate(retrieval_results, scores, ground_truths)
print(metrics)
# {'MRR@10': 0.351, 'Recall@10': 0.583, 'nDCG@10': 0.412, ...}