Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding Reinforced IR Generate Universal Query

From Leeroopedia


Knowledge Sources
Domains Information Retrieval, Data Generation, Query Generation
Last Updated 2026-02-09 00:00 GMT

Overview

Generates synthetic queries from corpus passages using LLMs with quality control for Reinforced IR training data.

Description

This script generates high-quality synthetic queries from corpus passages using large language models, forming the foundation of the Reinforced IR training pipeline. It loads corpus passages from datasets, uses task-specific prompts to generate relevant queries for each passage, and applies a quality control filter to ensure generated queries are appropriate and useful.

The pipeline operates in two stages: first generating queries using task-specific prompts (e.g., for fact verification, entity retrieval, etc.), then using the same or different LLM to evaluate query quality with a binary accept/reject decision. Only passages with quality score "1" are retained. The script supports multiple dataset structures including CQADupStack, can limit the number of training examples per dataset, and handles both local LLMs and API-based models.

Usage

Use this script as the first step in the Reinforced IR pipeline to create synthetic query-passage pairs from unlabeled corpus data, with automatic quality filtering to ensure high-quality training data.

Code Reference

Source Location

Signature

def main(opt):
    """Main function to generate and filter queries"""

def parse_option():
    """Parse command line arguments"""

Import

import argparse
import json
import random
from agent import GPTAgent, LLMAgent, LLMInstructAgent
from prompts import get_query_generation_prompt, get_quality_control_prompt

I/O Contract

Inputs

Name Type Required Description
generate_model_path str Yes Path or name of LLM for query generation
dataset_path str Yes Path to datasets with corpus.json files
output_dir str Yes Directory to save queries.json files
train_num int No Fixed number of training examples per dataset
train_ratio float No Ratio of corpus to use for training
temperature float No LLM generation temperature (default: 0.2)
max_tokens int No Max tokens per generation (default: 300)
model_type str Yes Type of LLM (llm, llm_instruct, gpt)

Outputs

Name Type Description
queries.json JSON Filtered query-passage pairs with quality score "1"

Usage Examples

# Command line usage
python generate_universal_query.py \
    --generate_model_path gpt-4o-mini \
    --model_type gpt \
    --api_key YOUR_API_KEY \
    --dataset_path ./data \
    --output_dir ./synthetic \
    --train_ratio 0.1 \
    --temperature 0.2 \
    --max_tokens 300

# Alternative with local LLM
python generate_universal_query.py \
    --generate_model_path Meta-Llama-3-8B \
    --model_type llm_instruct \
    --dataset_path ./data \
    --output_dir ./synthetic \
    --train_num 10000 \
    --gpu_memory_utilization 0.8

# Input corpus.json format:
[
    "Passage text about machine learning...",
    "Another passage about neural networks..."
]

# Output queries.json format:
[
    {
        "query": "What is machine learning?",
        "passage": "Passage text about machine learning..."
    }
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment