Implementation:Marker Inc Korea AutoRAG Factoid Query Gen
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Question Generation, Evaluation Methodology |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for generating factoid single-hop questions from passages using a large language model provided by the AutoRAG framework.
Description
The factoid_query_gen function generates a factoid question from a single passage by prompting a LlamaIndex-compatible LLM with a predefined factoid single-hop prompt template. It is an async function that operates on individual QA row dictionaries, making it compatible with the QA.batch_apply() method for efficient parallel processing.
Internally, factoid_query_gen delegates to the llama_index_generate_base helper function, which concatenates all retrieval ground truth contents into a numbered context string, appends it to the factoid prompt template (selected by language), and sends the resulting messages to the LLM via achat(). The LLM's response is stored in the query field of the row dictionary.
The module also provides additional query generation strategies including concept_completion_query_gen, two_hop_incremental, custom_query_gen, and the experimental multiple_queries_gen. All follow the same architectural pattern but use different prompt templates.
Usage
Import and use this function as the transformation function argument to QA.batch_apply(). The make_retrieval_gt_contents() method must be called on the QA instance first to populate the retrieval_gt_contents column that this function reads.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/data/qa/query/llama_gen_query.py (lines 25-32)
Signature
async def factoid_query_gen(
row: Dict,
llm: BaseLLM,
lang: str = "en",
) -> Dict:
return await llama_index_generate_base(
row, llm, QUERY_GEN_PROMPT["factoid_single_hop"][lang]
)
Import
from autorag.data.qa.query.llama_gen_query import factoid_query_gen
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| row | Dict | yes | A dictionary representing a single QA row. Must contain the key retrieval_gt_contents (List[List[str]]) with the passage texts. |
| llm | BaseLLM | yes | A LlamaIndex BaseLLM instance (e.g., OpenAI, Anthropic). Used for async chat completion via achat(). |
| lang | str | no | Language code for the prompt template. Supported values: "en", "ko", "ja". Defaults to "en". |
Outputs
| Name | Type | Description |
|---|---|---|
| row | Dict | The input row dictionary with an added query key containing the generated factoid question (str). |
Usage Examples
Basic Usage
from autorag.data.qa.schema import Raw
from autorag.data.qa.sample import random_single_hop
from autorag.data.qa.query.llama_gen_query import factoid_query_gen
from llama_index.llms.openai import OpenAI
# Set up LLM
llm = OpenAI(model="gpt-4o-mini")
# Build pipeline up to query generation
corpus = Raw(parsed_df).chunk("token", chunk_size=512)
qa = (corpus
.sample(random_single_hop, n=100)
.make_retrieval_gt_contents()
.batch_apply(factoid_query_gen, llm=llm, lang="en"))
# qa.data now has columns: qid, retrieval_gt, retrieval_gt_contents, query
print(qa.data[["qid", "query"]].head())
Korean Language Queries
from autorag.data.qa.query.llama_gen_query import factoid_query_gen
qa = (corpus
.sample(random_single_hop, n=50)
.make_retrieval_gt_contents()
.batch_apply(factoid_query_gen, llm=llm, lang="ko"))
With Custom Batch Size
# Process in smaller batches to stay within API rate limits
qa = (corpus
.sample(random_single_hop, n=500)
.make_retrieval_gt_contents()
.batch_apply(factoid_query_gen, batch_size=16, llm=llm, lang="en"))