Implementation:Marker Inc Korea AutoRAG Make Single Content QA

Knowledge Sources	Marker_Inc_Korea_AutoRAG AutoRAG Legacy Tutorial
Domains	Data_Engineering, QA_Generation, RAG
Last Updated	2026-02-08 06:00 GMT

Overview

Concrete tool for orchestrating single-content QA dataset creation from corpus DataFrames provided by the AutoRAG legacy pipeline.

Description

⚠️ LEGACY/DEPRECATED: This module is in the legacy/ directory and is superseded by the modern QA schema pipeline. See Heuristic:Marker_Inc_Korea_AutoRAG_Warning_Deprecated_Legacy_QA_Creation.

The make_single_content_qa function is the core orchestration layer for the legacy QA dataset creation workflow. It samples rows from a corpus DataFrame, runs a pluggable QA creation function in batches with progress tracking and intermediate caching, and assembles results into a DataFrame with qid, query, retrieval_gt, and generation_gt columns. The companion make_qa_with_existing_qa function (deprecated) uses ChromaDB vector retrieval to find relevant passages for existing queries.

Usage

Import this module when you need to generate single-hop QA evaluation datasets from a corpus DataFrame using the legacy pipeline. Pass any QA creation backend (LlamaIndex, RAGAS, guidance-based) as the qa_creation_func parameter. This is the entry point for the legacy data creation workflow before the modern QA schema-based pipeline was introduced.

Code Reference

Source Location

Repository: Marker_Inc_Korea_AutoRAG
File: autorag/data/legacy/qacreation/base.py
Lines: 1-239

Signature

def make_single_content_qa(
    corpus_df: pd.DataFrame,
    content_size: int,
    qa_creation_func: Callable,
    output_filepath: Optional[str] = None,
    upsert: bool = False,
    random_state: int = 42,
    cache_batch: int = 32,
    **kwargs,
) -> pd.DataFrame:
    """
    Make single content (single-hop, single-document) QA dataset using given qa_creation_func.

    :param corpus_df: The corpus dataframe to make QA dataset from.
    :param content_size: Number of contents to generate QA for.
    :param qa_creation_func: Function to create QA pairs (e.g. generate_qa_llama_index).
    :param output_filepath: Optional filepath to save parquet. Directory must exist.
    :param upsert: If true, overwrite existing file.
    :param random_state: Random state for sampling.
    :param cache_batch: Batch size for intermediate caching.
    :param kwargs: Additional keyword arguments for qa_creation_func.
    :return: QA dataset DataFrame.
    """

def make_qa_with_existing_qa(
    corpus_df: pd.DataFrame,
    existing_query_df: pd.DataFrame,
    content_size: int,
    answer_creation_func: Optional[Callable] = None,
    exist_gen_gt: Optional[bool] = False,
    output_filepath: Optional[str] = None,
    embedding_model: str = "openai_embed_3_large",
    collection: Optional[chromadb.Collection] = None,
    upsert: bool = False,
    random_state: int = 42,
    cache_batch: int = 32,
    top_k: int = 3,
    **kwargs,
) -> pd.DataFrame:
    """
    Make single-hop QA dataset using existing queries with vector retrieval.
    DEPRECATED.
    """

Import

from autorag.data.legacy.qacreation.base import make_single_content_qa, make_qa_with_existing_qa

I/O Contract

Inputs

Name	Type	Required	Description
corpus_df	pd.DataFrame	Yes	Corpus with doc_id and contents columns
content_size	int	Yes	Number of corpus rows to sample for QA generation
qa_creation_func	Callable	Yes	Backend function that accepts contents list and returns QA pairs
output_filepath	str	No	Path to save intermediate and final parquet file
upsert	bool	No	Whether to overwrite existing output file (default False)
random_state	int	No	Random seed for corpus sampling (default 42)
cache_batch	int	No	Batch size for intermediate saves (default 32)

Outputs

Name	Type	Description
qa_data	pd.DataFrame	DataFrame with columns: qid (UUID str), query (str), retrieval_gt (List[List[str]]), generation_gt (List[str])
parquet file	File	Saved to output_filepath if provided

Usage Examples

Basic QA Generation with LlamaIndex

import pandas as pd
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation.base import make_single_content_qa
from autorag.data.legacy.qacreation.llama_index import generate_qa_llama_index
from functools import partial

# 1. Load corpus
corpus_df = pd.read_parquet("./corpus.parquet")

# 2. Create QA generation function with LLM bound
llm = OpenAI(model="gpt-3.5-turbo")
qa_func = partial(generate_qa_llama_index, llm=llm)

# 3. Generate QA dataset
qa_df = make_single_content_qa(
    corpus_df=corpus_df,
    content_size=100,
    qa_creation_func=qa_func,
    output_filepath="./qa.parquet",
    cache_batch=16,
)

print(qa_df.head())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment