Implementation:FlagOpen FlagEmbedding LLM Embedder Eval QA

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Question_Answering, Natural_Questions, Retrieval_Augmented_Generation
Last Updated	2026-02-09 00:00 GMT

Overview

Evaluation pipeline for open-domain QA datasets (Natural Questions, TriviaQA) with retrieval-augmented generation measuring exact match accuracy.

Description

This implementation evaluates retrieval-augmented QA systems on standard open-domain datasets:

Retrieval phase: Uses dense encoders or BM25 to retrieve top-k relevant passages from a corpus given a question. The retrieved passages serve as knowledge context for answer generation.

Generation phase: Formats prompts with:

Retrieved knowledge passages (up to key_num)
Few-shot examples (randomly sampled from training data)
The test question

The LLM generates short answers which are evaluated using exact match after text normalization (lowercase, article removal, punctuation handling).

The process_qa() function handles context window management, truncating passages to fit while preserving few-shot examples and the query. The evaluate_qa() function computes exact match by normalizing both predictions and ground truth answers.

Usage

Use this for evaluating retrieval-augmented QA systems on Natural Questions, TriviaQA, or similar datasets where answers are short spans.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/llm_embedder/evaluation/eval_qa.py
Lines: 1-261

Signature

def process_qa(tokenizer, context_max_length=2048, key_num=3,
               few_shot=0, train_data=None, cache_dir=None,
               is_encoder_decoder=False)

def evaluate_qa(eval_data, save_path, **kwds)

def main()  # Entry point with QAArgs and GenerationArgs

Import

from research.llm_embedder.evaluation.eval_qa import main, evaluate_qa

I/O Contract

Inputs

Name	Type	Required	Description
eval_data	str	Yes	Path to test data JSON (NQ, TriviaQA format)
train_data	str	No	Path to training data for few-shot examples
corpus	str	Yes	Path to corpus for retrieval
model_name_or_path	str	Yes	LLM for answer generation
query_encoder	str	No	Dense encoder for retrieval
few_shot	int	No	Number of few-shot examples (default: 10)
key_num	int	No	Number of passages to provide (default: 3)

Outputs

Name	Type	Description
exact_match	float	Exact match accuracy after normalization
result_file	JSON	Saved results with queries, predictions, and answers

Usage Examples

# Command line usage
python research/llm_embedder/evaluation/eval_qa.py \
    --eval_data llm-embedder:qa/nq/test.json \
    --train_data llm-embedder:qa/nq/dev.json \
    --corpus llm-embedder:qa/nq/corpus.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --retrieval_method dense \
    --query_encoder BAAI/llm-embedder \
    --few_shot 10 \
    --key_num 3 \
    --max_new_tokens 32 \
    --output_dir data/results/qa

# Data format (test.json):
# {"query": "who wrote the song i can only imagine",
#  "answers": ["Bart Millard"],
#  "query_id": 0}

# Results: {"exact_match": 0.451}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment