Implementation:Marker Inc Korea AutoRAG Generate Simple QA Dataset
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, QA_Generation, NLP |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Concrete tool for generating simple QA datasets using the guidance library to create question-answer pairs from corpus data rows.
Description
⚠️ LEGACY/DEPRECATED: This module is in the legacy/ directory and is superseded by the modern QA schema pipeline. See Heuristic:Marker_Inc_Korea_AutoRAG_Warning_Deprecated_Legacy_QA_Creation.
This module provides a lightweight, guidance-library-based approach to QA dataset creation. generate_qa_row uses the guidance library's chat template to prompt an LLM with passage content and extract a query and answer via gen() calls in a structured conversation format. generate_simple_qa_dataset iterates over corpus rows sequentially, calls the generation function for each row, and assembles results into a parquet-compatible DataFrame with qid, query, retrieval_gt, generation_gt, and metadata columns.
Usage
Import these functions when using the legacy QA creation pipeline with a guidance-library LLM backend. This is suitable for simple single-document question generation without the complexity of async batch processing. Pass generate_simple_qa_dataset a guidance model and a corpus DataFrame to produce a QA evaluation dataset.
Code Reference
Source Location
- Repository: Marker_Inc_Korea_AutoRAG
- File: autorag/data/legacy/qacreation/simple.py
- Lines: 1-99
Signature
def generate_qa_row(llm, corpus_data_row):
"""
Generate a QA pair from a single corpus row using a guidance model.
:param llm: guidance model
:param corpus_data_row: Row with 'contents' column
:return: dict with 'query' and 'generation_gt' keys
"""
def generate_simple_qa_dataset(
llm,
corpus_data: pd.DataFrame,
output_filepath: str,
generate_row_function: Callable,
**kwargs,
):
"""
Generate QA dataset from corpus and save to parquet.
:param llm: guidance.models.Model
:param corpus_data: DataFrame with corpus data
:param output_filepath: Output parquet path (directory must exist, file must not)
:param generate_row_function: Function(llm, corpus_data_row, **kwargs) -> dict
:param kwargs: Additional args for generate_row_function
:return: QA dataset as pd.DataFrame
"""
Import
from autorag.data.legacy.qacreation.simple import generate_simple_qa_dataset, generate_qa_row
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| llm | guidance.models.Model | Yes | Guidance library LLM model instance |
| corpus_data | pd.DataFrame | Yes | Corpus DataFrame with doc_id, contents, and metadata columns |
| output_filepath | str | Yes | Output parquet file path (directory must exist, file must not exist) |
| generate_row_function | Callable | Yes | Function that takes (llm, corpus_data_row) and returns dict with query and generation_gt |
Outputs
| Name | Type | Description |
|---|---|---|
| qa_dataset | pd.DataFrame | DataFrame with qid (UUID str), query (str), retrieval_gt (List[List[str]]), generation_gt (List[str]), metadata (dict) |
| parquet file | File | Saved to output_filepath |
Usage Examples
Basic Simple QA Generation
import pandas as pd
import guidance
from autorag.data.legacy.qacreation.simple import (
generate_simple_qa_dataset,
generate_qa_row,
)
# 1. Load guidance model
llm = guidance.models.OpenAI("gpt-3.5-turbo")
# 2. Load corpus
corpus_df = pd.read_parquet("./corpus.parquet")
# 3. Generate QA dataset
qa_df = generate_simple_qa_dataset(
llm=llm,
corpus_data=corpus_df,
output_filepath="./qa_simple.parquet",
generate_row_function=generate_qa_row,
)
print(qa_df[["query", "generation_gt"]].head())
Custom Row Generation Function
def custom_qa_row(llm, corpus_data_row, **kwargs):
"""Custom QA generation with domain-specific prompting."""
from guidance import gen
import guidance
temp_llm = llm
with guidance.user():
temp_llm += f"Generate a technical question about: {corpus_data_row['contents']}"
with guidance.assistant():
temp_llm += gen("query", stop="?")
with guidance.user():
temp_llm += "Now provide the answer:"
with guidance.assistant():
temp_llm += gen("generation_gt")
return {"query": temp_llm["query"], "generation_gt": temp_llm["generation_gt"]}
qa_df = generate_simple_qa_dataset(
llm=llm,
corpus_data=corpus_df,
output_filepath="./qa_custom.parquet",
generate_row_function=custom_qa_row,
)