Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder EvalNQ

From Leeroopedia
Revision as of 14:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FlagOpen_FlagEmbedding_LLM_Embedder_EvalNQ.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Question_Answering, Natural_Questions, Answer_Presence_Detection
Last Updated 2026-02-09 00:00 GMT

Overview

Specialized Natural Questions evaluation computing recall based on answer string presence in retrieved passages.

Description

This module evaluates retrieval quality for Natural Questions by checking if answers appear in retrieved documents using fuzzy string matching:

SimpleTokenizer performs Unicode-aware tokenization separating alphanumeric sequences and non-whitespace characters, enabling language-agnostic text processing with proper handling of diacritics and special characters.

has_answer() function checks if any answer variant appears in a document by: 1. Normalizing both text and answer using Unicode NFD decomposition 2. Tokenizing with case-insensitive matching 3. Searching for exact token sequence matches allowing answers to appear anywhere in the document

evaluate_nq() computes relaxed recall at multiple cutoffs (1, 5, 10, 20, 100) by:

  • Finding the first retrieved document containing a correct answer for each query
  • Crediting all subsequent ranks if the first hit occurs at rank k
  • Computing recall as the proportion of queries with answers found by rank k

This metric better reflects real-world utility where finding the answer anywhere in top-k is sufficient, unlike strict ranking metrics.

Usage

Use this for evaluating retrieval systems on Natural Questions where the goal is finding documents containing answer strings rather than ranking specific passages.

Code Reference

Source Location

Signature

class SimpleTokenizer:
    def tokenize(self, text, uncase=False)

def has_answer(answers, text, tokenizer) -> bool

def evaluate_nq(retrieval_result: dict, eval_data: datasets.Dataset,
                corpus: datasets.Dataset, num_workers=16, batch_size=16,
                cache_dir=None)

Import

from research.llm_embedder.src.retrieval.evalnq import evaluate_nq

I/O Contract

Inputs

Name Type Required Description
retrieval_result dict Yes Dict mapping query index to list of passage indices
eval_data str/Dataset Yes Queries with "answers" field (list of strings)
corpus str/Dataset Yes Documents with "content" field
num_workers int No Number of worker processes (default: 16)
batch_size int No Batch size for evaluation (default: 16)

Outputs

Name Type Description
recall@1 float Recall at rank 1
recall@5 float Recall at rank 5
recall@10 float Recall at rank 10
recall@20 float Recall at rank 20
recall@100 float Recall at rank 100

Usage Examples

import datasets
from research.llm_embedder.src.retrieval.evalnq import evaluate_nq

# Load data
eval_data = datasets.load_dataset("json", data_files="nq-test.json", split="train")
corpus = datasets.load_dataset("json", data_files="nq-corpus.json", split="train")

# Retrieval results: query_idx -> list of passage indices
retrieval_result = {
    0: [42, 153, 789, 12, ...],  # Top 100 passages for query 0
    1: [234, 567, 89, ...],       # Top 100 passages for query 1
    # ...
}

# Evaluate
metrics = evaluate_nq(
    retrieval_result=retrieval_result,
    eval_data=eval_data,
    corpus=corpus,
    num_workers=16,
    batch_size=16
)

print(metrics)
# {'recall@1': 0.423, 'recall@5': 0.631, 'recall@10': 0.712,
#  'recall@20': 0.778, 'recall@100': 0.852}

# Example data format:
# eval_data: {"query": "who wrote hamlet", "answers": ["William Shakespeare", "Shakespeare"]}
# corpus: {"content": "Hamlet is a tragedy written by William Shakespeare..."}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment