Principle:Marker Inc Korea AutoRAG Query Generation
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Question Generation, Evaluation Methodology |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Query generation uses large language models to synthesize natural language questions from passages, creating the question component of evaluation QA pairs that test a RAG system's retrieval and generation capabilities.
Description
Query generation is the step in the evaluation data creation pipeline where source passages are transformed into questions that a user might naturally ask. The generated queries become the inputs against which the RAG system is evaluated: the system must retrieve the correct source passages and produce an accurate answer. The quality, diversity, and naturalness of these queries directly determine how meaningful the evaluation results will be.
Several query generation strategies exist, each targeting different question characteristics:
- Factoid queries ask about specific facts stated in the passage (e.g., "What year was the company founded?"). These are the most common type and test basic information retrieval.
- Concept completion queries ask the reader to complete or explain a concept described in the passage (e.g., "Explain the mechanism by which X works."). These test deeper comprehension.
- Two-hop incremental queries require synthesizing information from two separate passages to answer, testing cross-document reasoning.
- Custom queries use user-defined prompts, enabling domain-specific question styles.
All strategies follow a common pattern: the passage content is formatted into a prompt template alongside system instructions, sent to an LLM, and the model's response is extracted as the generated question. The prompt engineering is critical; it must guide the LLM to produce questions that are answerable from the given passage, natural in phrasing, and specific enough to have a clear correct answer.
Usage
Query generation is applied after corpus sampling and before answer generation. It is invoked via the QA class's batch_apply() method with an async query generation function and an LLM instance. The make_retrieval_gt_contents() method must be called first to populate the passage contents that the query generator needs.
Theoretical Basis
The query generation process follows this pattern:
INPUT: QA DataFrame with columns (qid, retrieval_gt, retrieval_gt_contents)
LLM instance L
Language lang
Strategy prompt template T
OUTPUT: QA DataFrame with added column (query)
For each row r_i:
context = concatenate(r_i.retrieval_gt_contents)
prompt = T.format(context=context)
messages = [SystemMessage(T.system), UserMessage(prompt)]
response = L.achat(messages)
r_i.query = response.content
Factoid query generation specifically instructs the LLM with a system prompt such as:
System: You are a helpful assistant that generates a factoid question
from the given text. The question must be answerable using only
the provided text. Generate exactly one question.
User: Text:
1. [passage content]
Generated Question from the Text:
Quality dimensions for generated queries:
| Dimension | Description |
|---|---|
| Answerability | The question must be answerable from the source passage alone |
| Specificity | The question should have a clear, unambiguous answer |
| Naturalness | The question should resemble what a real user would ask |
| Diversity | Across the dataset, questions should cover varied aspects of the content |
| Independence | Ideally, the question should be understandable without seeing the passage |
The batch processing model processes all QA rows asynchronously with configurable batch sizes, allowing efficient utilization of LLM API rate limits while preventing memory exhaustion on large datasets.