Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder Eval QReCC

From Leeroopedia


Knowledge Sources
Domains Conversational_Search, Question_Rewriting, Dialogue_Systems
Last Updated 2026-02-09 00:00 GMT

Overview

Evaluation framework for QReCC (Question Rewriting in Conversational Context) dataset measuring both retrieval and generation quality.

Description

This implementation evaluates conversational search systems on the QReCC dataset which contains queries that depend on conversation context. The evaluation includes:

Retrieval evaluation: Computes nDCG and Recall@k on retrieved passages against gold standard relevance labels, measuring how well the system finds relevant documents for context-dependent queries.

Generation evaluation (optional): If do_generate=True, generates answers and computes ROUGE-L scores comparing generated responses to reference answers.

The process_qrecc() function formats prompts for conversational queries by:

  • Prepending retrieved knowledge passages
  • Including the conversational query with context
  • Generating answers that leverage both conversation history and retrieved information

QReCC is challenging because queries contain coreferences and implicit context requiring understanding of the conversation flow.

Usage

Use this to evaluate conversational search systems where queries depend on dialogue context and require both retrieval and answer generation capabilities.

Code Reference

Source Location

Signature

def process_qrecc(tokenizer, context_max_length=2048, key_num=3,
                  is_encoder_decoder=False)

def evaluate_qrecc(eval_data, save_path, **kwds)

def main()  # Entry point with QRECCArgs and GenerationArgs

Import

from research.llm_embedder.evaluation.eval_qrecc import main, evaluate_qrecc

I/O Contract

Inputs

Name Type Required Description
eval_data str Yes Path to QReCC test data (concatenated format)
corpus str Yes Path to corpus for retrieval
model_name_or_path str No LLM for answer generation (if do_generate=True)
query_encoder str No Dense encoder for retrieval
key_num int No Number of passages to retrieve (default: 3)
hits int No Number of candidates from retrieval (default: 100)
do_generate bool No Whether to generate answers (default: False)

Outputs

Name Type Description
ndcg@3, ndcg@10, ndcg@100 float Normalized DCG at cutoffs
recall@3, recall@10, recall@100 float Recall at cutoffs
rl float ROUGE-L F1 score (if do_generate=True)

Usage Examples

# Retrieval-only evaluation
python research/llm_embedder/evaluation/eval_qrecc.py \
    --eval_data llm-embedder:convsearch/qrecc/test.concat.json \
    --corpus llm-embedder:convsearch/qrecc/corpus.json \
    --retrieval_method dense \
    --query_encoder BAAI/llm-embedder \
    --hits 100 \
    --output_dir data/results/qrecc

# With answer generation
python research/llm_embedder/evaluation/eval_qrecc.py \
    --eval_data llm-embedder:convsearch/qrecc/test.concat.json \
    --corpus llm-embedder:convsearch/qrecc/corpus.json \
    --retrieval_method dense \
    --query_encoder BAAI/llm-embedder \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --do_generate \
    --key_num 3 \
    --max_new_tokens 128

# Data format:
# {"query": "What about his childhood?",  # Context-dependent
#  "query_id": 123,
#  "answers": ["He grew up in Boston"],
#  "positive_ctxs": [...]}

# Results: {"ndcg@10": 0.423, "recall@10": 0.678, "rl": 0.312}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment