Implementation:FlagOpen FlagEmbedding LLM Embedder LM Score
| Knowledge Sources | |
|---|---|
| Domains | Language_Modeling, Knowledge_Distillation, Negative_Likelihood |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Script for computing language model scores on query-passage-answer triplets to generate teacher signals for retrieval model training.
Description
This implementation computes negative log-likelihoods (NLLs) using a language model to score how well passages answer queries, creating teacher scores for knowledge distillation in retrieval training.
Process: 1. For each query, scores all candidate passages (positives and negatives) by computing the NLL of generating the answer conditioned on query + passage 2. Passages that lead to lower perplexity (more natural answer generation) receive higher scores 3. Scores are collated by query and saved back to the dataset as "teacher_scores" field
Task-specific formatting:
- QA tasks: "Knowledge: {passage}\n\nQuestion: {query}\n\nAnswer: {answer}"
- Chat tasks: "{history}\nSpeaker 1: {query}\nSpeaker 2: {answer}"
- ICL tasks: "{few_shot_examples}\n{query}\n{answer}"
- LRLM tasks: Uses pre-tokenized inputs for long-range language modeling
The process_lm_scoring() function handles tokenization and label preparation, masking all tokens except the answer portion for NLL computation. This focuses the scoring on answer quality rather than question understanding.
Usage
Use this to generate teacher scores from a strong language model for distilling retrieval knowledge into smaller embedding models.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/run_lm_score.py
- Lines: 1-240
Signature
def process_lm_scoring(tokenizer, key_max_length=512)
def collate_scores(eval_data, save_name)
def main() # Entry point with ScoreArgs
Import
from research.llm_embedder.run_lm_score import main
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_data | str | Yes | Path to JSON with query, pos, neg, answers fields |
| model_name_or_path | str | Yes | Language model for scoring (e.g., LLaMA, GPT) |
| key_max_length | int | No | Max length for truncating passages (default: 512) |
| lm_batch_size | int | No | Batch size for LM inference (default: 4) |
| save_name | str | No | Name for output file (default: "llama2-7b-chat") |
Outputs
| Name | Type | Description |
|---|---|---|
| scored_file | JSONL | Original data with added "teacher_scores" field for each query |
| query_ids | List | Query IDs |
| scores | List | NLL scores for each passage per query |
Usage Examples
# Command line usage
python research/llm_embedder/run_lm_score.py \
--eval_data train_data.json \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--key_max_length 512 \
--lm_batch_size 4 \
--save_name llama2-7b-chat \
--lm_dtype bf16
# Input format (train_data.json):
# {"query": "What is machine learning?",
# "answers": ["A field of AI..."],
# "pos": ["Machine learning is..."],
# "neg": ["Deep learning...", "AI is..."]}
# Output format (train_data.scored.llama2-7b-chat.json):
# {"query": "What is machine learning?",
# "answers": ["A field of AI..."],
# "pos": ["Machine learning is..."],
# "neg": ["Deep learning...", "AI is..."],
# "teacher_scores": [-2.34, -4.56, -5.12]} # Lower = better
# These scores can then be used for distillation:
# python train.py --train_data train_data.scored.llama2-7b-chat.json