Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Marker Inc Korea AutoRAG Generate Simple QA Dataset

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, QA_Generation, NLP
Last Updated 2026-02-08 06:00 GMT

Overview

Concrete tool for generating simple QA datasets using the guidance library to create question-answer pairs from corpus data rows.

Description

⚠️ LEGACY/DEPRECATED: This module is in the legacy/ directory and is superseded by the modern QA schema pipeline. See Heuristic:Marker_Inc_Korea_AutoRAG_Warning_Deprecated_Legacy_QA_Creation.

This module provides a lightweight, guidance-library-based approach to QA dataset creation. generate_qa_row uses the guidance library's chat template to prompt an LLM with passage content and extract a query and answer via gen() calls in a structured conversation format. generate_simple_qa_dataset iterates over corpus rows sequentially, calls the generation function for each row, and assembles results into a parquet-compatible DataFrame with qid, query, retrieval_gt, generation_gt, and metadata columns.

Usage

Import these functions when using the legacy QA creation pipeline with a guidance-library LLM backend. This is suitable for simple single-document question generation without the complexity of async batch processing. Pass generate_simple_qa_dataset a guidance model and a corpus DataFrame to produce a QA evaluation dataset.

Code Reference

Source Location

Signature

def generate_qa_row(llm, corpus_data_row):
    """
    Generate a QA pair from a single corpus row using a guidance model.

    :param llm: guidance model
    :param corpus_data_row: Row with 'contents' column
    :return: dict with 'query' and 'generation_gt' keys
    """
def generate_simple_qa_dataset(
    llm,
    corpus_data: pd.DataFrame,
    output_filepath: str,
    generate_row_function: Callable,
    **kwargs,
):
    """
    Generate QA dataset from corpus and save to parquet.

    :param llm: guidance.models.Model
    :param corpus_data: DataFrame with corpus data
    :param output_filepath: Output parquet path (directory must exist, file must not)
    :param generate_row_function: Function(llm, corpus_data_row, **kwargs) -> dict
    :param kwargs: Additional args for generate_row_function
    :return: QA dataset as pd.DataFrame
    """

Import

from autorag.data.legacy.qacreation.simple import generate_simple_qa_dataset, generate_qa_row

I/O Contract

Inputs

Name Type Required Description
llm guidance.models.Model Yes Guidance library LLM model instance
corpus_data pd.DataFrame Yes Corpus DataFrame with doc_id, contents, and metadata columns
output_filepath str Yes Output parquet file path (directory must exist, file must not exist)
generate_row_function Callable Yes Function that takes (llm, corpus_data_row) and returns dict with query and generation_gt

Outputs

Name Type Description
qa_dataset pd.DataFrame DataFrame with qid (UUID str), query (str), retrieval_gt (List[List[str]]), generation_gt (List[str]), metadata (dict)
parquet file File Saved to output_filepath

Usage Examples

Basic Simple QA Generation

import pandas as pd
import guidance

from autorag.data.legacy.qacreation.simple import (
    generate_simple_qa_dataset,
    generate_qa_row,
)

# 1. Load guidance model
llm = guidance.models.OpenAI("gpt-3.5-turbo")

# 2. Load corpus
corpus_df = pd.read_parquet("./corpus.parquet")

# 3. Generate QA dataset
qa_df = generate_simple_qa_dataset(
    llm=llm,
    corpus_data=corpus_df,
    output_filepath="./qa_simple.parquet",
    generate_row_function=generate_qa_row,
)

print(qa_df[["query", "generation_gt"]].head())

Custom Row Generation Function

def custom_qa_row(llm, corpus_data_row, **kwargs):
    """Custom QA generation with domain-specific prompting."""
    from guidance import gen
    import guidance

    temp_llm = llm
    with guidance.user():
        temp_llm += f"Generate a technical question about: {corpus_data_row['contents']}"
    with guidance.assistant():
        temp_llm += gen("query", stop="?")
    with guidance.user():
        temp_llm += "Now provide the answer:"
    with guidance.assistant():
        temp_llm += gen("generation_gt")

    return {"query": temp_llm["query"], "generation_gt": temp_llm["generation_gt"]}

qa_df = generate_simple_qa_dataset(
    llm=llm,
    corpus_data=corpus_df,
    output_filepath="./qa_custom.parquet",
    generate_row_function=custom_qa_row,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment