Implementation:FlagOpen FlagEmbedding Reinforced IR Generate Generator Data
| Knowledge Sources | |
|---|---|
| Domains | Information Retrieval, Data Generation, Direct Preference Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Generates DPO training data for query augmentation models in the Reinforced IR pipeline.
Description
This script generates Direct Preference Optimization (DPO) training data for fine-tuning language models to produce effective query augmentations. It uses a two-stage process: first generating multiple candidate augmentations (answers) for each query using an LLM, then using a retrieval model to rank these augmentations based on how well they retrieve the target passage.
The pipeline loads existing queries from the synthetic data directory, generates N different augmentations per query, and evaluates each augmentation's effectiveness using a retrieval model. It constructs DPO training pairs where better-performing augmentations are marked as "chosen" and worse ones as "rejected". The script supports multiple LLM types (local or API-based) and can process multiple datasets with configurable thresholds and rules.
Usage
Use this script to create training data for fine-tuning query augmentation models that help improve retrieval performance by generating informative context for queries.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/Reinforced_IR/data_generation/generate_generator_data.py
- Lines: 1-200
Signature
def main(opt):
"""Main function to generate DPO training data for query augmentation"""
def parse_option():
"""Parse command line arguments"""
Import
import argparse
import json
from FlagEmbedding import FlagModel
from agent import GPTAgent, LLMAgent, LLMInstructAgent
from utils import generate_llm_dpo_train_data
from prompts import get_additional_info_generation_prompt
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| generate_model_path | str | Yes | Path to LLM for generating augmentations |
| retrieval_model_name | str | Yes | Retrieval model for evaluating augmentations |
| dataset_path | str | Yes | Path to datasets directory |
| output_dir | str | Yes | Directory with queries.json files |
| dpo_num | int | Yes | Number of augmentation candidates per query |
| threshold | float | Yes | Score threshold for DPO pair selection |
| temperature | float | No | Generation temperature (default: 0.2) |
| max_tokens | int | No | Max tokens per generation (default: 300) |
Outputs
| Name | Type | Description |
|---|---|---|
| answers.json | JSON | Multiple augmentation candidates per query |
| train.jsonl | JSONL | DPO training data with prompt, chosen, rejected fields |
Usage Examples
# Command line usage
python generate_generator_data.py \
--generate_model_path Meta-Llama-3-8B \
--model_type llm_instruct \
--retrieval_model_name BAAI/bge-large-en-v1.5 \
--dataset_path ./data \
--output_dir ./synthetic \
--dpo_num 10 \
--threshold 0.95 \
--temperature 0.2 \
--max_tokens 300 \
--batch_size 1024
# Expected queries.json format:
[
{
"query": "What is machine learning?",
"passage": "Machine learning is a subset of AI..."
}
]
# Output train.jsonl format:
{
"prompt": "Generate additional info for: What is machine learning?",
"chosen": "ML is a technique that enables computers to learn...",
"rejected": "Less relevant or lower-scoring augmentation..."
}