Implementation:Marker Inc Korea AutoRAG QA Batch Apply Factoid Query Gen
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Generation |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Concrete tool for generating factoid questions from passages using LLMs, provided by AutoRAG's QA schema and query modules.
Description
QA.batch_apply is the async batch execution method that applies a generation function to each row of the QA DataFrame. For query generation, it is used with factoid_query_gen (available in both OpenAI and LlamaIndex variants). The OpenAI variant uses structured output parsing via Pydantic models, while the LlamaIndex variant uses LlamaIndex's chat interface. Both use language-specific prompt templates.
Usage
Use QA.batch_apply with factoid_query_gen after calling QA.make_retrieval_gt_contents() to populate passage contents. Choose the OpenAI variant when using GPT models, or the LlamaIndex variant for other LLM providers.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/data/qa/schema.py (QA.batch_apply), autorag/data/qa/query/openai_gen_query.py (factoid_query_gen OpenAI), autorag/data/qa/query/llama_gen_query.py (factoid_query_gen LlamaIndex)
- Lines: schema.py L134-146, openai_gen_query.py L39-47, llama_gen_query.py L25-32
Signature
# QA.batch_apply (schema.py)
def batch_apply(
self,
fn: Callable[[Dict, Any], Awaitable[Dict]],
batch_size: int = 32,
**kwargs
) -> "QA":
"""
Apply an async function to each row in batches.
Args:
fn: Async function that takes a row dict and returns modified row dict.
batch_size: Number of concurrent tasks (default 32).
**kwargs: Additional args passed to fn.
"""
# OpenAI variant (openai_gen_query.py)
async def factoid_query_gen(
row: Dict,
client: AsyncClient,
model_name: str = "gpt-4o-2024-08-06",
lang: str = "en",
) -> Dict:
"""Generate a factoid question using OpenAI structured output."""
# LlamaIndex variant (llama_gen_query.py)
async def factoid_query_gen(
row: Dict,
llm: BaseLLM,
lang: str = "en",
) -> Dict:
"""Generate a factoid question using LlamaIndex LLM."""
Import
from autorag.data.qa.schema import QA
from autorag.data.qa.query.openai_gen_query import factoid_query_gen # OpenAI
# OR
from autorag.data.qa.query.llama_gen_query import factoid_query_gen # LlamaIndex
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| QA instance | QA | Yes | Must have retrieval_gt_contents column (call make_retrieval_gt_contents() first) |
| client | AsyncClient | Yes (OpenAI) | OpenAI async client |
| llm | BaseLLM | Yes (LlamaIndex) | LlamaIndex LLM instance |
| model_name | str | No | Model name (default: gpt-4o-2024-08-06) |
| lang | str | No | Language code: en, ko, or ja (default: en) |
| batch_size | int | No | Concurrent batch size (default: 32) |
Outputs
| Name | Type | Description |
|---|---|---|
| QA instance | QA | Original QA with added "query" column containing generated questions |
Usage Examples
Generate Factoid Queries with OpenAI
from openai import AsyncClient
from autorag.data.qa.query.openai_gen_query import factoid_query_gen
client = AsyncClient()
# qa must have retrieval_gt_contents (call make_retrieval_gt_contents() first)
qa = qa.make_retrieval_gt_contents()
# Generate factoid questions
qa = qa.batch_apply(
factoid_query_gen,
client=client,
model_name="gpt-4o-2024-08-06",
lang="en",
batch_size=32,
)
print(qa.data["query"].head())