Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding Reinforced IR Generate Retriever Data

From Leeroopedia


Knowledge Sources
Domains Information Retrieval, Data Generation, Embedder Training
Last Updated 2026-02-09 00:00 GMT

Overview

Generates training data for retrieval models using LLM-generated query augmentations.

Description

This script generates training data for fine-tuning dense retrieval models by leveraging LLM-generated query augmentations. It follows a multi-stage pipeline: first loading corpus data, then using an LLM to generate synthetic queries for each passage, and finally using a retrieval model to mine hard negatives from the corpus. The augmentation step generates additional context that transforms original queries into more informative representations.

The pipeline supports optional data filtering to limit training set size and can handle complex dataset structures like CQADupStack. It uses the FlagModel retrieval system to encode queries and passages, compute similarity scores, and select appropriate negative samples. The script generates training data in BGE format with query, positive passage(s), and hard negative passages, with the augmented queries prefixed with "Generate the topic about this passage:".

Usage

Use this script to create high-quality training data for retrieval models by augmenting queries with LLM-generated context and mining hard negatives from the corpus.

Code Reference

Source Location

Signature

def main(opt):
    """Main function to generate retrieval training data"""

def parse_option():
    """Parse command line arguments"""

Import

import argparse
import json
from agent import GPTAgent, LLMAgent, LLMInstructAgent
from prompts import get_additional_info_generation_prompt
from FlagEmbedding import FlagModel
from utils import generate_bge_train_data

I/O Contract

Inputs

Name Type Required Description
generate_model_path str Yes Path to LLM for query augmentation
retrieval_model_name str Yes Retrieval model for hard negative mining
dataset_path str Yes Path to datasets with corpus.json files
output_dir str Yes Directory with queries.json for processing
filter_data bool No Whether to filter/limit training data
filter_num int No Number of examples to keep per dataset
temperature float No LLM generation temperature (default: 0.2)
max_tokens int No Max tokens per generation (default: 300)

Outputs

Name Type Description
answers.json JSON Query-augmentation pairs
train.jsonl JSONL Training data with query, pos, neg fields for BGE

Usage Examples

# Command line usage
python generate_retriever_data.py \
    --generate_model_path Meta-Llama-3-8B \
    --model_type llm_instruct \
    --retrieval_model_name BAAI/bge-large-en-v1.5 \
    --dataset_path ./data \
    --output_dir ./synthetic \
    --filter_data False \
    --temperature 0.2 \
    --max_tokens 300 \
    --batch_size 1024 \
    --neg_type 95neg

# Input queries.json format:
[
    {
        "query": "What is deep learning?",
        "passage": "Deep learning is a subset of ML..."
    }
]

# Output train.jsonl format:
{
    "query": "Generate the topic about this passage: Deep learning involves neural networks...",
    "pos": ["Deep learning is a subset of ML..."],
    "neg": ["Negative passage 1", "Negative passage 2", ...]
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment