Implementation:FlagOpen FlagEmbedding LLM Embedder EvalNQ
| Knowledge Sources | |
|---|---|
| Domains | Question_Answering, Natural_Questions, Answer_Presence_Detection |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Specialized Natural Questions evaluation computing recall based on answer string presence in retrieved passages.
Description
This module evaluates retrieval quality for Natural Questions by checking if answers appear in retrieved documents using fuzzy string matching:
SimpleTokenizer performs Unicode-aware tokenization separating alphanumeric sequences and non-whitespace characters, enabling language-agnostic text processing with proper handling of diacritics and special characters.
has_answer() function checks if any answer variant appears in a document by: 1. Normalizing both text and answer using Unicode NFD decomposition 2. Tokenizing with case-insensitive matching 3. Searching for exact token sequence matches allowing answers to appear anywhere in the document
evaluate_nq() computes relaxed recall at multiple cutoffs (1, 5, 10, 20, 100) by:
- Finding the first retrieved document containing a correct answer for each query
- Crediting all subsequent ranks if the first hit occurs at rank k
- Computing recall as the proportion of queries with answers found by rank k
This metric better reflects real-world utility where finding the answer anywhere in top-k is sufficient, unlike strict ranking metrics.
Usage
Use this for evaluating retrieval systems on Natural Questions where the goal is finding documents containing answer strings rather than ranking specific passages.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/src/retrieval/evalnq.py
- Lines: 1-121
Signature
class SimpleTokenizer:
def tokenize(self, text, uncase=False)
def has_answer(answers, text, tokenizer) -> bool
def evaluate_nq(retrieval_result: dict, eval_data: datasets.Dataset,
corpus: datasets.Dataset, num_workers=16, batch_size=16,
cache_dir=None)
Import
from research.llm_embedder.src.retrieval.evalnq import evaluate_nq
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| retrieval_result | dict | Yes | Dict mapping query index to list of passage indices |
| eval_data | str/Dataset | Yes | Queries with "answers" field (list of strings) |
| corpus | str/Dataset | Yes | Documents with "content" field |
| num_workers | int | No | Number of worker processes (default: 16) |
| batch_size | int | No | Batch size for evaluation (default: 16) |
Outputs
| Name | Type | Description |
|---|---|---|
| recall@1 | float | Recall at rank 1 |
| recall@5 | float | Recall at rank 5 |
| recall@10 | float | Recall at rank 10 |
| recall@20 | float | Recall at rank 20 |
| recall@100 | float | Recall at rank 100 |
Usage Examples
import datasets
from research.llm_embedder.src.retrieval.evalnq import evaluate_nq
# Load data
eval_data = datasets.load_dataset("json", data_files="nq-test.json", split="train")
corpus = datasets.load_dataset("json", data_files="nq-corpus.json", split="train")
# Retrieval results: query_idx -> list of passage indices
retrieval_result = {
0: [42, 153, 789, 12, ...], # Top 100 passages for query 0
1: [234, 567, 89, ...], # Top 100 passages for query 1
# ...
}
# Evaluate
metrics = evaluate_nq(
retrieval_result=retrieval_result,
eval_data=eval_data,
corpus=corpus,
num_workers=16,
batch_size=16
)
print(metrics)
# {'recall@1': 0.423, 'recall@5': 0.631, 'recall@10': 0.712,
# 'recall@20': 0.778, 'recall@100': 0.852}
# Example data format:
# eval_data: {"query": "who wrote hamlet", "answers": ["William Shakespeare", "Shakespeare"]}
# corpus: {"content": "Hamlet is a tragedy written by William Shakespeare..."}