Implementation:FlagOpen FlagEmbedding Reinforced IR Generate Retriever Data
| Knowledge Sources | |
|---|---|
| Domains | Information Retrieval, Data Generation, Embedder Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Generates training data for retrieval models using LLM-generated query augmentations.
Description
This script generates training data for fine-tuning dense retrieval models by leveraging LLM-generated query augmentations. It follows a multi-stage pipeline: first loading corpus data, then using an LLM to generate synthetic queries for each passage, and finally using a retrieval model to mine hard negatives from the corpus. The augmentation step generates additional context that transforms original queries into more informative representations.
The pipeline supports optional data filtering to limit training set size and can handle complex dataset structures like CQADupStack. It uses the FlagModel retrieval system to encode queries and passages, compute similarity scores, and select appropriate negative samples. The script generates training data in BGE format with query, positive passage(s), and hard negative passages, with the augmented queries prefixed with "Generate the topic about this passage:".
Usage
Use this script to create high-quality training data for retrieval models by augmenting queries with LLM-generated context and mining hard negatives from the corpus.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/Reinforced_IR/data_generation/generate_retriever_data.py
- Lines: 1-184
Signature
def main(opt):
"""Main function to generate retrieval training data"""
def parse_option():
"""Parse command line arguments"""
Import
import argparse
import json
from agent import GPTAgent, LLMAgent, LLMInstructAgent
from prompts import get_additional_info_generation_prompt
from FlagEmbedding import FlagModel
from utils import generate_bge_train_data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| generate_model_path | str | Yes | Path to LLM for query augmentation |
| retrieval_model_name | str | Yes | Retrieval model for hard negative mining |
| dataset_path | str | Yes | Path to datasets with corpus.json files |
| output_dir | str | Yes | Directory with queries.json for processing |
| filter_data | bool | No | Whether to filter/limit training data |
| filter_num | int | No | Number of examples to keep per dataset |
| temperature | float | No | LLM generation temperature (default: 0.2) |
| max_tokens | int | No | Max tokens per generation (default: 300) |
Outputs
| Name | Type | Description |
|---|---|---|
| answers.json | JSON | Query-augmentation pairs |
| train.jsonl | JSONL | Training data with query, pos, neg fields for BGE |
Usage Examples
# Command line usage
python generate_retriever_data.py \
--generate_model_path Meta-Llama-3-8B \
--model_type llm_instruct \
--retrieval_model_name BAAI/bge-large-en-v1.5 \
--dataset_path ./data \
--output_dir ./synthetic \
--filter_data False \
--temperature 0.2 \
--max_tokens 300 \
--batch_size 1024 \
--neg_type 95neg
# Input queries.json format:
[
{
"query": "What is deep learning?",
"passage": "Deep learning is a subset of ML..."
}
]
# Output train.jsonl format:
{
"query": "Generate the topic about this passage: Deep learning involves neural networks...",
"pos": ["Deep learning is a subset of ML..."],
"neg": ["Negative passage 1", "Negative passage 2", ...]
}