Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder Eval MSC

From Leeroopedia


Knowledge Sources
Domains Large_Language_Models, Conversational_AI, Multi_Turn_Dialogue
Last Updated 2026-02-09 00:00 GMT

Overview

Evaluation script for Multi-Session Chat (MSC) perplexity measurement with conversation history retrieval.

Description

This implementation evaluates language models on multi-turn conversations by computing perplexity with retrieval-augmented context. The HistoryCollator handles conversation histories of variable lengths, padding them and creating masks to identify valid history entries.

The evaluation uses SelfRetrievalLM to:

  • Retrieve relevant conversation turns from history based on the current query
  • Integrate retrieved context into the prompt
  • Compute perplexity on the next response

This measures how well retrieval helps the LM predict responses in long conversation contexts where full history exceeds the context window. The system supports various retrieval methods and can apply task-specific instructions for the chat domain.

Usage

Use this to evaluate retrieval-augmented language models on multi-turn dialogue tasks where conversation history must be selectively retrieved rather than fully included.

Code Reference

Source Location

Signature

@dataclass
class HistoryCollator:
    def __call__(self, batch_elem)

def main()  # Entry point with LRLMArgs configuration

Import

from research.llm_embedder.evaluation.eval_msc import main, HistoryCollator

I/O Contract

Inputs

Name Type Required Description
eval_data str Yes Path to JSON with queries, histories, and answers
model_name_or_path str Yes HuggingFace model name or path
retrieval_method str Yes Retrieval method for history: "dense", "bm25", "no"
key_num int No Number of history turns to retrieve (default: 1)
batch_elem List[Dict] Yes Batch with query, history, answer fields

Outputs

Name Type Description
query np.ndarray Array of queries
history np.ndarray Padded conversation histories
history_mask torch.BoolTensor Mask indicating valid history entries
answer np.ndarray Ground truth answers
perplexity float Perplexity on answer generation

Usage Examples

# Command line usage
python research/llm_embedder/evaluation/eval_msc.py \
    --eval_data llm-embedder:chat/msc/test.json \
    --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --retrieval_method dense \
    --query_encoder BAAI/llm-embedder \
    --key_num 3 \
    --add_instruction \
    --log_path data/results/msc/msc.log

# Data format (test.json):
# {"query": "What did I say earlier?",
#  "history": ["I like cats", "They are cute", ...],
#  "answers": ["You said you like cats"]}

# Results: {"perplexity": 8.45}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment