Implementation:FlagOpen FlagEmbedding LLM Embedder Eval QA
| Knowledge Sources | |
|---|---|
| Domains | Question_Answering, Natural_Questions, Retrieval_Augmented_Generation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Evaluation pipeline for open-domain QA datasets (Natural Questions, TriviaQA) with retrieval-augmented generation measuring exact match accuracy.
Description
This implementation evaluates retrieval-augmented QA systems on standard open-domain datasets:
Retrieval phase: Uses dense encoders or BM25 to retrieve top-k relevant passages from a corpus given a question. The retrieved passages serve as knowledge context for answer generation.
Generation phase: Formats prompts with:
- Retrieved knowledge passages (up to key_num)
- Few-shot examples (randomly sampled from training data)
- The test question
The LLM generates short answers which are evaluated using exact match after text normalization (lowercase, article removal, punctuation handling).
The process_qa() function handles context window management, truncating passages to fit while preserving few-shot examples and the query. The evaluate_qa() function computes exact match by normalizing both predictions and ground truth answers.
Usage
Use this for evaluating retrieval-augmented QA systems on Natural Questions, TriviaQA, or similar datasets where answers are short spans.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/evaluation/eval_qa.py
- Lines: 1-261
Signature
def process_qa(tokenizer, context_max_length=2048, key_num=3,
few_shot=0, train_data=None, cache_dir=None,
is_encoder_decoder=False)
def evaluate_qa(eval_data, save_path, **kwds)
def main() # Entry point with QAArgs and GenerationArgs
Import
from research.llm_embedder.evaluation.eval_qa import main, evaluate_qa
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_data | str | Yes | Path to test data JSON (NQ, TriviaQA format) |
| train_data | str | No | Path to training data for few-shot examples |
| corpus | str | Yes | Path to corpus for retrieval |
| model_name_or_path | str | Yes | LLM for answer generation |
| query_encoder | str | No | Dense encoder for retrieval |
| few_shot | int | No | Number of few-shot examples (default: 10) |
| key_num | int | No | Number of passages to provide (default: 3) |
Outputs
| Name | Type | Description |
|---|---|---|
| exact_match | float | Exact match accuracy after normalization |
| result_file | JSON | Saved results with queries, predictions, and answers |
Usage Examples
# Command line usage
python research/llm_embedder/evaluation/eval_qa.py \
--eval_data llm-embedder:qa/nq/test.json \
--train_data llm-embedder:qa/nq/dev.json \
--corpus llm-embedder:qa/nq/corpus.json \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--retrieval_method dense \
--query_encoder BAAI/llm-embedder \
--few_shot 10 \
--key_num 3 \
--max_new_tokens 32 \
--output_dir data/results/qa
# Data format (test.json):
# {"query": "who wrote the song i can only imagine",
# "answers": ["Bart Millard"],
# "query_id": 0}
# Results: {"exact_match": 0.451}