Implementation:FlagOpen FlagEmbedding LLM Embedder Eval PopQA

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Question_Answering, Retrieval_Augmented_Generation, Knowledge_Intensive_Tasks
Last Updated	2026-02-09 00:00 GMT

Overview

Evaluation framework for PopQA (Popular Question Answering) dataset with retrieval-augmented generation measuring answer accuracy.

Description

This implementation evaluates retrieval-augmented QA by first retrieving relevant passages and then generating answers. It handles PopQA's property-based question format with 16 question templates across different properties (occupation, birthplace, genre, etc.).

Key features include:

Retrieval-based passage selection from a corpus using dense or BM25 methods
Few-shot prompting with property-specific question templates (avoiding train/test contamination by excluding same-property examples)
Answer generation with retrieved knowledge context
Accuracy evaluation checking if any possible answer appears in the generated text

The process_popqa() function formats prompts by concatenating retrieved passages, few-shot examples, and the test query within the context window. Answers are considered correct if they match any of the possible answer strings (exact match, lowercase, or capitalized).

Usage

Use this to evaluate retrieval-augmented QA systems on knowledge-intensive questions requiring factual information from an external corpus.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/llm_embedder/evaluation/eval_popqa.py
Lines: 1-281

Signature

def process_popqa(tokenizer, context_max_length=2048, key_num=3,
                  few_shot=0, train_data=None, cache_dir=None,
                  is_encoder_decoder=False)

def evaluate_popqa(eval_data, save_path, **kwds)

def main()  # Entry point with PopQAArgs and GenerationArgs

Import

from research.llm_embedder.evaluation.eval_popqa import main, evaluate_popqa

I/O Contract

Inputs

Name	Type	Required	Description
eval_data	str	Yes	Path to PopQA test data JSON
corpus	str	Yes	Path to corpus for retrieval
model_name_or_path	str	Yes	LLM for answer generation
query_encoder	str	No	Dense encoder for retrieval (if using dense method)
few_shot	int	No	Number of few-shot examples (default: 15)
key_num	int	No	Number of passages to retrieve (default: 3)
hits	int	No	Number of candidates from retrieval (default: 10)

Outputs

Name	Type	Description
accuracy	float	Proportion of correct answers
result_file	JSON	Saved results with query, generation, and correctness

Usage Examples

# Command line usage
python research/llm_embedder/evaluation/eval_popqa.py \
    --eval_data llm-embedder:qa/popqa/test.json \
    --corpus llm-embedder:qa/nq/corpus.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --retrieval_method dense \
    --query_encoder BAAI/llm-embedder \
    --few_shot 15 \
    --key_num 3 \
    --hits 10 \
    --max_new_tokens 16 \
    --output_dir data/results/popqa

# Data format:
# {"query": "What is Einstein's occupation?",
#  "prop_id": 22,
#  "possible_answers": ["physicist", "theoretical physicist"],
#  "query_id": 123}

# Results: {"accuracy": 0.856}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment