Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder Eval QA

From Leeroopedia


Knowledge Sources
Domains Question_Answering, Natural_Questions, Retrieval_Augmented_Generation
Last Updated 2026-02-09 00:00 GMT

Overview

Evaluation pipeline for open-domain QA datasets (Natural Questions, TriviaQA) with retrieval-augmented generation measuring exact match accuracy.

Description

This implementation evaluates retrieval-augmented QA systems on standard open-domain datasets:

Retrieval phase: Uses dense encoders or BM25 to retrieve top-k relevant passages from a corpus given a question. The retrieved passages serve as knowledge context for answer generation.

Generation phase: Formats prompts with:

  • Retrieved knowledge passages (up to key_num)
  • Few-shot examples (randomly sampled from training data)
  • The test question

The LLM generates short answers which are evaluated using exact match after text normalization (lowercase, article removal, punctuation handling).

The process_qa() function handles context window management, truncating passages to fit while preserving few-shot examples and the query. The evaluate_qa() function computes exact match by normalizing both predictions and ground truth answers.

Usage

Use this for evaluating retrieval-augmented QA systems on Natural Questions, TriviaQA, or similar datasets where answers are short spans.

Code Reference

Source Location

Signature

def process_qa(tokenizer, context_max_length=2048, key_num=3,
               few_shot=0, train_data=None, cache_dir=None,
               is_encoder_decoder=False)

def evaluate_qa(eval_data, save_path, **kwds)

def main()  # Entry point with QAArgs and GenerationArgs

Import

from research.llm_embedder.evaluation.eval_qa import main, evaluate_qa

I/O Contract

Inputs

Name Type Required Description
eval_data str Yes Path to test data JSON (NQ, TriviaQA format)
train_data str No Path to training data for few-shot examples
corpus str Yes Path to corpus for retrieval
model_name_or_path str Yes LLM for answer generation
query_encoder str No Dense encoder for retrieval
few_shot int No Number of few-shot examples (default: 10)
key_num int No Number of passages to provide (default: 3)

Outputs

Name Type Description
exact_match float Exact match accuracy after normalization
result_file JSON Saved results with queries, predictions, and answers

Usage Examples

# Command line usage
python research/llm_embedder/evaluation/eval_qa.py \
    --eval_data llm-embedder:qa/nq/test.json \
    --train_data llm-embedder:qa/nq/dev.json \
    --corpus llm-embedder:qa/nq/corpus.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --retrieval_method dense \
    --query_encoder BAAI/llm-embedder \
    --few_shot 10 \
    --key_num 3 \
    --max_new_tokens 32 \
    --output_dir data/results/qa

# Data format (test.json):
# {"query": "who wrote the song i can only imagine",
#  "answers": ["Bart Millard"],
#  "query_id": 0}

# Results: {"exact_match": 0.451}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment