Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder Eval PopQA

From Leeroopedia


Knowledge Sources
Domains Question_Answering, Retrieval_Augmented_Generation, Knowledge_Intensive_Tasks
Last Updated 2026-02-09 00:00 GMT

Overview

Evaluation framework for PopQA (Popular Question Answering) dataset with retrieval-augmented generation measuring answer accuracy.

Description

This implementation evaluates retrieval-augmented QA by first retrieving relevant passages and then generating answers. It handles PopQA's property-based question format with 16 question templates across different properties (occupation, birthplace, genre, etc.).

Key features include:

  • Retrieval-based passage selection from a corpus using dense or BM25 methods
  • Few-shot prompting with property-specific question templates (avoiding train/test contamination by excluding same-property examples)
  • Answer generation with retrieved knowledge context
  • Accuracy evaluation checking if any possible answer appears in the generated text

The process_popqa() function formats prompts by concatenating retrieved passages, few-shot examples, and the test query within the context window. Answers are considered correct if they match any of the possible answer strings (exact match, lowercase, or capitalized).

Usage

Use this to evaluate retrieval-augmented QA systems on knowledge-intensive questions requiring factual information from an external corpus.

Code Reference

Source Location

Signature

def process_popqa(tokenizer, context_max_length=2048, key_num=3,
                  few_shot=0, train_data=None, cache_dir=None,
                  is_encoder_decoder=False)

def evaluate_popqa(eval_data, save_path, **kwds)

def main()  # Entry point with PopQAArgs and GenerationArgs

Import

from research.llm_embedder.evaluation.eval_popqa import main, evaluate_popqa

I/O Contract

Inputs

Name Type Required Description
eval_data str Yes Path to PopQA test data JSON
corpus str Yes Path to corpus for retrieval
model_name_or_path str Yes LLM for answer generation
query_encoder str No Dense encoder for retrieval (if using dense method)
few_shot int No Number of few-shot examples (default: 15)
key_num int No Number of passages to retrieve (default: 3)
hits int No Number of candidates from retrieval (default: 10)

Outputs

Name Type Description
accuracy float Proportion of correct answers
result_file JSON Saved results with query, generation, and correctness

Usage Examples

# Command line usage
python research/llm_embedder/evaluation/eval_popqa.py \
    --eval_data llm-embedder:qa/popqa/test.json \
    --corpus llm-embedder:qa/nq/corpus.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --retrieval_method dense \
    --query_encoder BAAI/llm-embedder \
    --few_shot 15 \
    --key_num 3 \
    --hits 10 \
    --max_new_tokens 16 \
    --output_dir data/results/popqa

# Data format:
# {"query": "What is Einstein's occupation?",
#  "prop_id": 22,
#  "possible_answers": ["physicist", "theoretical physicist"],
#  "query_id": 123}

# Results: {"accuracy": 0.856}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment