Implementation:FlagOpen FlagEmbedding LLM Embedder Eval PopQA
| Knowledge Sources | |
|---|---|
| Domains | Question_Answering, Retrieval_Augmented_Generation, Knowledge_Intensive_Tasks |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Evaluation framework for PopQA (Popular Question Answering) dataset with retrieval-augmented generation measuring answer accuracy.
Description
This implementation evaluates retrieval-augmented QA by first retrieving relevant passages and then generating answers. It handles PopQA's property-based question format with 16 question templates across different properties (occupation, birthplace, genre, etc.).
Key features include:
- Retrieval-based passage selection from a corpus using dense or BM25 methods
- Few-shot prompting with property-specific question templates (avoiding train/test contamination by excluding same-property examples)
- Answer generation with retrieved knowledge context
- Accuracy evaluation checking if any possible answer appears in the generated text
The process_popqa() function formats prompts by concatenating retrieved passages, few-shot examples, and the test query within the context window. Answers are considered correct if they match any of the possible answer strings (exact match, lowercase, or capitalized).
Usage
Use this to evaluate retrieval-augmented QA systems on knowledge-intensive questions requiring factual information from an external corpus.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/evaluation/eval_popqa.py
- Lines: 1-281
Signature
def process_popqa(tokenizer, context_max_length=2048, key_num=3,
few_shot=0, train_data=None, cache_dir=None,
is_encoder_decoder=False)
def evaluate_popqa(eval_data, save_path, **kwds)
def main() # Entry point with PopQAArgs and GenerationArgs
Import
from research.llm_embedder.evaluation.eval_popqa import main, evaluate_popqa
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_data | str | Yes | Path to PopQA test data JSON |
| corpus | str | Yes | Path to corpus for retrieval |
| model_name_or_path | str | Yes | LLM for answer generation |
| query_encoder | str | No | Dense encoder for retrieval (if using dense method) |
| few_shot | int | No | Number of few-shot examples (default: 15) |
| key_num | int | No | Number of passages to retrieve (default: 3) |
| hits | int | No | Number of candidates from retrieval (default: 10) |
Outputs
| Name | Type | Description |
|---|---|---|
| accuracy | float | Proportion of correct answers |
| result_file | JSON | Saved results with query, generation, and correctness |
Usage Examples
# Command line usage
python research/llm_embedder/evaluation/eval_popqa.py \
--eval_data llm-embedder:qa/popqa/test.json \
--corpus llm-embedder:qa/nq/corpus.json \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--retrieval_method dense \
--query_encoder BAAI/llm-embedder \
--few_shot 15 \
--key_num 3 \
--hits 10 \
--max_new_tokens 16 \
--output_dir data/results/popqa
# Data format:
# {"query": "What is Einstein's occupation?",
# "prop_id": 22,
# "possible_answers": ["physicist", "theoretical physicist"],
# "query_id": 123}
# Results: {"accuracy": 0.856}