Implementation:FlagOpen FlagEmbedding LLM Embedder Eval MSC
| Knowledge Sources | |
|---|---|
| Domains | Large_Language_Models, Conversational_AI, Multi_Turn_Dialogue |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Evaluation script for Multi-Session Chat (MSC) perplexity measurement with conversation history retrieval.
Description
This implementation evaluates language models on multi-turn conversations by computing perplexity with retrieval-augmented context. The HistoryCollator handles conversation histories of variable lengths, padding them and creating masks to identify valid history entries.
The evaluation uses SelfRetrievalLM to:
- Retrieve relevant conversation turns from history based on the current query
- Integrate retrieved context into the prompt
- Compute perplexity on the next response
This measures how well retrieval helps the LM predict responses in long conversation contexts where full history exceeds the context window. The system supports various retrieval methods and can apply task-specific instructions for the chat domain.
Usage
Use this to evaluate retrieval-augmented language models on multi-turn dialogue tasks where conversation history must be selectively retrieved rather than fully included.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/evaluation/eval_msc.py
- Lines: 1-154
Signature
@dataclass
class HistoryCollator:
def __call__(self, batch_elem)
def main() # Entry point with LRLMArgs configuration
Import
from research.llm_embedder.evaluation.eval_msc import main, HistoryCollator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_data | str | Yes | Path to JSON with queries, histories, and answers |
| model_name_or_path | str | Yes | HuggingFace model name or path |
| retrieval_method | str | Yes | Retrieval method for history: "dense", "bm25", "no" |
| key_num | int | No | Number of history turns to retrieve (default: 1) |
| batch_elem | List[Dict] | Yes | Batch with query, history, answer fields |
Outputs
| Name | Type | Description |
|---|---|---|
| query | np.ndarray | Array of queries |
| history | np.ndarray | Padded conversation histories |
| history_mask | torch.BoolTensor | Mask indicating valid history entries |
| answer | np.ndarray | Ground truth answers |
| perplexity | float | Perplexity on answer generation |
Usage Examples
# Command line usage
python research/llm_embedder/evaluation/eval_msc.py \
--eval_data llm-embedder:chat/msc/test.json \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--retrieval_method dense \
--query_encoder BAAI/llm-embedder \
--key_num 3 \
--add_instruction \
--log_path data/results/msc/msc.log
# Data format (test.json):
# {"query": "What did I say earlier?",
# "history": ["I like cats", "They are cute", ...],
# "answers": ["You said you like cats"]}
# Results: {"perplexity": 8.45}