Implementation:FlagOpen FlagEmbedding LLM Embedder Eval QReCC
| Knowledge Sources | |
|---|---|
| Domains | Conversational_Search, Question_Rewriting, Dialogue_Systems |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Evaluation framework for QReCC (Question Rewriting in Conversational Context) dataset measuring both retrieval and generation quality.
Description
This implementation evaluates conversational search systems on the QReCC dataset which contains queries that depend on conversation context. The evaluation includes:
Retrieval evaluation: Computes nDCG and Recall@k on retrieved passages against gold standard relevance labels, measuring how well the system finds relevant documents for context-dependent queries.
Generation evaluation (optional): If do_generate=True, generates answers and computes ROUGE-L scores comparing generated responses to reference answers.
The process_qrecc() function formats prompts for conversational queries by:
- Prepending retrieved knowledge passages
- Including the conversational query with context
- Generating answers that leverage both conversation history and retrieved information
QReCC is challenging because queries contain coreferences and implicit context requiring understanding of the conversation flow.
Usage
Use this to evaluate conversational search systems where queries depend on dialogue context and require both retrieval and answer generation capabilities.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/evaluation/eval_qrecc.py
- Lines: 1-235
Signature
def process_qrecc(tokenizer, context_max_length=2048, key_num=3,
is_encoder_decoder=False)
def evaluate_qrecc(eval_data, save_path, **kwds)
def main() # Entry point with QRECCArgs and GenerationArgs
Import
from research.llm_embedder.evaluation.eval_qrecc import main, evaluate_qrecc
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_data | str | Yes | Path to QReCC test data (concatenated format) |
| corpus | str | Yes | Path to corpus for retrieval |
| model_name_or_path | str | No | LLM for answer generation (if do_generate=True) |
| query_encoder | str | No | Dense encoder for retrieval |
| key_num | int | No | Number of passages to retrieve (default: 3) |
| hits | int | No | Number of candidates from retrieval (default: 100) |
| do_generate | bool | No | Whether to generate answers (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| ndcg@3, ndcg@10, ndcg@100 | float | Normalized DCG at cutoffs |
| recall@3, recall@10, recall@100 | float | Recall at cutoffs |
| rl | float | ROUGE-L F1 score (if do_generate=True) |
Usage Examples
# Retrieval-only evaluation
python research/llm_embedder/evaluation/eval_qrecc.py \
--eval_data llm-embedder:convsearch/qrecc/test.concat.json \
--corpus llm-embedder:convsearch/qrecc/corpus.json \
--retrieval_method dense \
--query_encoder BAAI/llm-embedder \
--hits 100 \
--output_dir data/results/qrecc
# With answer generation
python research/llm_embedder/evaluation/eval_qrecc.py \
--eval_data llm-embedder:convsearch/qrecc/test.concat.json \
--corpus llm-embedder:convsearch/qrecc/corpus.json \
--retrieval_method dense \
--query_encoder BAAI/llm-embedder \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--do_generate \
--key_num 3 \
--max_new_tokens 128
# Data format:
# {"query": "What about his childhood?", # Context-dependent
# "query_id": 123,
# "answers": ["He grew up in Boston"],
# "positive_ctxs": [...]}
# Results: {"ndcg@10": 0.423, "recall@10": 0.678, "rl": 0.312}