Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding Reinforced IR Generate Distill Data

From Leeroopedia


Knowledge Sources
Domains Information Retrieval, Knowledge Distillation, Data Generation
Last Updated 2026-02-09 00:00 GMT

Overview

Generates distillation scores from teacher LLMs for improving retrieval model training data.

Description

This script enhances existing retrieval training data by adding teacher scores from a large language model. It reads training data in BGE format (query, positive passages, negative passages), constructs ranking prompts for each query-passage combination, and uses a teacher LLM to generate relevance scores. These scores are then integrated back into the training data for knowledge distillation during retrieval model fine-tuning.

The pipeline tokenizes passages to ensure they fit within the model's context window, formats them as numbered lists with the query, and prompts the teacher LLM to rank them. The teacher's ranking is parsed and converted to scores that are added to the training data under pos_scores and neg_scores fields. This allows student retrieval models to learn from the teacher's nuanced understanding of relevance beyond binary labels.

Usage

Use this script to augment retrieval training data with teacher model scores for knowledge distillation, improving the quality of learned relevance judgments.

Code Reference

Source Location

Signature

def main(opt):
    """Main function to add distillation scores to training data"""

def parse_option():
    """Parse command line arguments"""

Import

import argparse
import json
from transformers import AutoTokenizer
from agent import GPTAgent, LLMAgent, LLMInstructAgent
from prompts import rank_prompt
from utils import get_distill_data

I/O Contract

Inputs

Name Type Required Description
generate_model_path str Yes Path to teacher LLM for scoring
dataset_path str Yes Path to datasets directory
output_dir str Yes Directory with train.jsonl files
dataset_name str No Specific dataset to process (default: all)
temperature float No LLM generation temperature (default: 0.2)
max_tokens int No Max tokens for LLM response (default: 300)
model_type str Yes Type of LLM (llm, llm_instruct, gpt)

Outputs

Name Type Description
train.jsonl JSONL Updated training data with pos_scores and neg_scores fields

Usage Examples

# Command line usage
python generate_retriever_distill_data.py \
    --generate_model_path Meta-Llama-3-70B-Instruct \
    --model_type llm_instruct \
    --dataset_path ./data \
    --output_dir ./synthetic \
    --temperature 0.2 \
    --max_tokens 300

# Input train.jsonl format:
{
    "query": "What is machine learning?",
    "pos": ["ML is a type of AI..."],
    "neg": ["Neural networks...", "Deep learning..."]
}

# Output train.jsonl format (augmented):
{
    "query": "What is machine learning?",
    "pos": ["ML is a type of AI..."],
    "pos_scores": [0.95],
    "neg": ["Neural networks...", "Deep learning..."],
    "neg_scores": [0.65, 0.55]
}

# The rank_prompt formats passages as:
# Given query and N passages, rank them by relevance
# Query: {query}
# Passages:
# [0] {passage_0}
# [1] {passage_1}
# ...

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment