Implementation:FlagOpen FlagEmbedding LLM Embedder Eval MSC

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Large_Language_Models, Conversational_AI, Multi_Turn_Dialogue
Last Updated	2026-02-09 00:00 GMT

Overview

Evaluation script for Multi-Session Chat (MSC) perplexity measurement with conversation history retrieval.

Description

This implementation evaluates language models on multi-turn conversations by computing perplexity with retrieval-augmented context. The HistoryCollator handles conversation histories of variable lengths, padding them and creating masks to identify valid history entries.

The evaluation uses SelfRetrievalLM to:

Retrieve relevant conversation turns from history based on the current query
Integrate retrieved context into the prompt
Compute perplexity on the next response

This measures how well retrieval helps the LM predict responses in long conversation contexts where full history exceeds the context window. The system supports various retrieval methods and can apply task-specific instructions for the chat domain.

Usage

Use this to evaluate retrieval-augmented language models on multi-turn dialogue tasks where conversation history must be selectively retrieved rather than fully included.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/llm_embedder/evaluation/eval_msc.py
Lines: 1-154

Signature

@dataclass
class HistoryCollator:
    def __call__(self, batch_elem)

def main()  # Entry point with LRLMArgs configuration

Import

from research.llm_embedder.evaluation.eval_msc import main, HistoryCollator

I/O Contract

Inputs

Name	Type	Required	Description
eval_data	str	Yes	Path to JSON with queries, histories, and answers
model_name_or_path	str	Yes	HuggingFace model name or path
retrieval_method	str	Yes	Retrieval method for history: "dense", "bm25", "no"
key_num	int	No	Number of history turns to retrieve (default: 1)
batch_elem	List[Dict]	Yes	Batch with query, history, answer fields

Outputs

Name	Type	Description
query	np.ndarray	Array of queries
history	np.ndarray	Padded conversation histories
history_mask	torch.BoolTensor	Mask indicating valid history entries
answer	np.ndarray	Ground truth answers
perplexity	float	Perplexity on answer generation

Usage Examples

# Command line usage
python research/llm_embedder/evaluation/eval_msc.py \
    --eval_data llm-embedder:chat/msc/test.json \
    --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --retrieval_method dense \
    --query_encoder BAAI/llm-embedder \
    --key_num 3 \
    --add_instruction \
    --log_path data/results/msc/msc.log

# Data format (test.json):
# {"query": "What did I say earlier?",
#  "history": ["I like cats", "They are cute", ...],
#  "answers": ["You said you like cats"]}

# Results: {"perplexity": 8.45}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment