Implementation:FlagOpen FlagEmbedding Reinforced IR Generate Distill Data
| Knowledge Sources | |
|---|---|
| Domains | Information Retrieval, Knowledge Distillation, Data Generation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Generates distillation scores from teacher LLMs for improving retrieval model training data.
Description
This script enhances existing retrieval training data by adding teacher scores from a large language model. It reads training data in BGE format (query, positive passages, negative passages), constructs ranking prompts for each query-passage combination, and uses a teacher LLM to generate relevance scores. These scores are then integrated back into the training data for knowledge distillation during retrieval model fine-tuning.
The pipeline tokenizes passages to ensure they fit within the model's context window, formats them as numbered lists with the query, and prompts the teacher LLM to rank them. The teacher's ranking is parsed and converted to scores that are added to the training data under pos_scores and neg_scores fields. This allows student retrieval models to learn from the teacher's nuanced understanding of relevance beyond binary labels.
Usage
Use this script to augment retrieval training data with teacher model scores for knowledge distillation, improving the quality of learned relevance judgments.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/Reinforced_IR/data_generation/generate_retriever_distill_data.py
- Lines: 1-119
Signature
def main(opt):
"""Main function to add distillation scores to training data"""
def parse_option():
"""Parse command line arguments"""
Import
import argparse
import json
from transformers import AutoTokenizer
from agent import GPTAgent, LLMAgent, LLMInstructAgent
from prompts import rank_prompt
from utils import get_distill_data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| generate_model_path | str | Yes | Path to teacher LLM for scoring |
| dataset_path | str | Yes | Path to datasets directory |
| output_dir | str | Yes | Directory with train.jsonl files |
| dataset_name | str | No | Specific dataset to process (default: all) |
| temperature | float | No | LLM generation temperature (default: 0.2) |
| max_tokens | int | No | Max tokens for LLM response (default: 300) |
| model_type | str | Yes | Type of LLM (llm, llm_instruct, gpt) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.jsonl | JSONL | Updated training data with pos_scores and neg_scores fields |
Usage Examples
# Command line usage
python generate_retriever_distill_data.py \
--generate_model_path Meta-Llama-3-70B-Instruct \
--model_type llm_instruct \
--dataset_path ./data \
--output_dir ./synthetic \
--temperature 0.2 \
--max_tokens 300
# Input train.jsonl format:
{
"query": "What is machine learning?",
"pos": ["ML is a type of AI..."],
"neg": ["Neural networks...", "Deep learning..."]
}
# Output train.jsonl format (augmented):
{
"query": "What is machine learning?",
"pos": ["ML is a type of AI..."],
"pos_scores": [0.95],
"neg": ["Neural networks...", "Deep learning..."],
"neg_scores": [0.65, 0.55]
}
# The rank_prompt formats passages as:
# Given query and N passages, rank them by relevance
# Query: {query}
# Passages:
# [0] {passage_0}
# [1] {passage_1}
# ...