Implementation:FlagOpen FlagEmbedding Reinforced IR Generate Distill Data

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Information Retrieval, Knowledge Distillation, Data Generation
Last Updated	2026-02-09 00:00 GMT

Overview

Generates distillation scores from teacher LLMs for improving retrieval model training data.

Description

This script enhances existing retrieval training data by adding teacher scores from a large language model. It reads training data in BGE format (query, positive passages, negative passages), constructs ranking prompts for each query-passage combination, and uses a teacher LLM to generate relevance scores. These scores are then integrated back into the training data for knowledge distillation during retrieval model fine-tuning.

The pipeline tokenizes passages to ensure they fit within the model's context window, formats them as numbered lists with the query, and prompts the teacher LLM to rank them. The teacher's ranking is parsed and converted to scores that are added to the training data under pos_scores and neg_scores fields. This allows student retrieval models to learn from the teacher's nuanced understanding of relevance beyond binary labels.

Usage

Use this script to augment retrieval training data with teacher model scores for knowledge distillation, improving the quality of learned relevance judgments.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/Reinforced_IR/data_generation/generate_retriever_distill_data.py
Lines: 1-119

Signature

def main(opt):
    """Main function to add distillation scores to training data"""

def parse_option():
    """Parse command line arguments"""

Import

import argparse
import json
from transformers import AutoTokenizer
from agent import GPTAgent, LLMAgent, LLMInstructAgent
from prompts import rank_prompt
from utils import get_distill_data

I/O Contract

Inputs

Name	Type	Required	Description
generate_model_path	str	Yes	Path to teacher LLM for scoring
dataset_path	str	Yes	Path to datasets directory
output_dir	str	Yes	Directory with train.jsonl files
dataset_name	str	No	Specific dataset to process (default: all)
temperature	float	No	LLM generation temperature (default: 0.2)
max_tokens	int	No	Max tokens for LLM response (default: 300)
model_type	str	Yes	Type of LLM (llm, llm_instruct, gpt)

Outputs

Name	Type	Description
train.jsonl	JSONL	Updated training data with pos_scores and neg_scores fields

Usage Examples

# Command line usage
python generate_retriever_distill_data.py \
    --generate_model_path Meta-Llama-3-70B-Instruct \
    --model_type llm_instruct \
    --dataset_path ./data \
    --output_dir ./synthetic \
    --temperature 0.2 \
    --max_tokens 300

# Input train.jsonl format:
{
    "query": "What is machine learning?",
    "pos": ["ML is a type of AI..."],
    "neg": ["Neural networks...", "Deep learning..."]
}

# Output train.jsonl format (augmented):
{
    "query": "What is machine learning?",
    "pos": ["ML is a type of AI..."],
    "pos_scores": [0.95],
    "neg": ["Neural networks...", "Deep learning..."],
    "neg_scores": [0.65, 0.55]
}

# The rank_prompt formats passages as:
# Given query and N passages, rank them by relevance
# Query: {query}
# Passages:
# [0] {passage_0}
# [1] {passage_1}
# ...

Related Pages

Principle:FlagOpen_FlagEmbedding_Reinforced_Domain_Adaptation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment