Implementation:FlagOpen FlagEmbedding Reinforced IR Generate Universal Query
| Knowledge Sources | |
|---|---|
| Domains | Information Retrieval, Data Generation, Query Generation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Generates synthetic queries from corpus passages using LLMs with quality control for Reinforced IR training data.
Description
This script generates high-quality synthetic queries from corpus passages using large language models, forming the foundation of the Reinforced IR training pipeline. It loads corpus passages from datasets, uses task-specific prompts to generate relevant queries for each passage, and applies a quality control filter to ensure generated queries are appropriate and useful.
The pipeline operates in two stages: first generating queries using task-specific prompts (e.g., for fact verification, entity retrieval, etc.), then using the same or different LLM to evaluate query quality with a binary accept/reject decision. Only passages with quality score "1" are retained. The script supports multiple dataset structures including CQADupStack, can limit the number of training examples per dataset, and handles both local LLMs and API-based models.
Usage
Use this script as the first step in the Reinforced IR pipeline to create synthetic query-passage pairs from unlabeled corpus data, with automatic quality filtering to ensure high-quality training data.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/Reinforced_IR/data_generation/generate_universal_query.py
- Lines: 1-129
Signature
def main(opt):
"""Main function to generate and filter queries"""
def parse_option():
"""Parse command line arguments"""
Import
import argparse
import json
import random
from agent import GPTAgent, LLMAgent, LLMInstructAgent
from prompts import get_query_generation_prompt, get_quality_control_prompt
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| generate_model_path | str | Yes | Path or name of LLM for query generation |
| dataset_path | str | Yes | Path to datasets with corpus.json files |
| output_dir | str | Yes | Directory to save queries.json files |
| train_num | int | No | Fixed number of training examples per dataset |
| train_ratio | float | No | Ratio of corpus to use for training |
| temperature | float | No | LLM generation temperature (default: 0.2) |
| max_tokens | int | No | Max tokens per generation (default: 300) |
| model_type | str | Yes | Type of LLM (llm, llm_instruct, gpt) |
Outputs
| Name | Type | Description |
|---|---|---|
| queries.json | JSON | Filtered query-passage pairs with quality score "1" |
Usage Examples
# Command line usage
python generate_universal_query.py \
--generate_model_path gpt-4o-mini \
--model_type gpt \
--api_key YOUR_API_KEY \
--dataset_path ./data \
--output_dir ./synthetic \
--train_ratio 0.1 \
--temperature 0.2 \
--max_tokens 300
# Alternative with local LLM
python generate_universal_query.py \
--generate_model_path Meta-Llama-3-8B \
--model_type llm_instruct \
--dataset_path ./data \
--output_dir ./synthetic \
--train_num 10000 \
--gpu_memory_utilization 0.8
# Input corpus.json format:
[
"Passage text about machine learning...",
"Another passage about neural networks..."
]
# Output queries.json format:
[
{
"query": "What is machine learning?",
"passage": "Passage text about machine learning..."
}
]