Implementation:Marker Inc Korea AutoRAG Make Single Content QA
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, QA_Generation, RAG |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Concrete tool for orchestrating single-content QA dataset creation from corpus DataFrames provided by the AutoRAG legacy pipeline.
Description
⚠️ LEGACY/DEPRECATED: This module is in the legacy/ directory and is superseded by the modern QA schema pipeline. See Heuristic:Marker_Inc_Korea_AutoRAG_Warning_Deprecated_Legacy_QA_Creation.
The make_single_content_qa function is the core orchestration layer for the legacy QA dataset creation workflow. It samples rows from a corpus DataFrame, runs a pluggable QA creation function in batches with progress tracking and intermediate caching, and assembles results into a DataFrame with qid, query, retrieval_gt, and generation_gt columns. The companion make_qa_with_existing_qa function (deprecated) uses ChromaDB vector retrieval to find relevant passages for existing queries.
Usage
Import this module when you need to generate single-hop QA evaluation datasets from a corpus DataFrame using the legacy pipeline. Pass any QA creation backend (LlamaIndex, RAGAS, guidance-based) as the qa_creation_func parameter. This is the entry point for the legacy data creation workflow before the modern QA schema-based pipeline was introduced.
Code Reference
Source Location
- Repository: Marker_Inc_Korea_AutoRAG
- File: autorag/data/legacy/qacreation/base.py
- Lines: 1-239
Signature
def make_single_content_qa(
corpus_df: pd.DataFrame,
content_size: int,
qa_creation_func: Callable,
output_filepath: Optional[str] = None,
upsert: bool = False,
random_state: int = 42,
cache_batch: int = 32,
**kwargs,
) -> pd.DataFrame:
"""
Make single content (single-hop, single-document) QA dataset using given qa_creation_func.
:param corpus_df: The corpus dataframe to make QA dataset from.
:param content_size: Number of contents to generate QA for.
:param qa_creation_func: Function to create QA pairs (e.g. generate_qa_llama_index).
:param output_filepath: Optional filepath to save parquet. Directory must exist.
:param upsert: If true, overwrite existing file.
:param random_state: Random state for sampling.
:param cache_batch: Batch size for intermediate caching.
:param kwargs: Additional keyword arguments for qa_creation_func.
:return: QA dataset DataFrame.
"""
def make_qa_with_existing_qa(
corpus_df: pd.DataFrame,
existing_query_df: pd.DataFrame,
content_size: int,
answer_creation_func: Optional[Callable] = None,
exist_gen_gt: Optional[bool] = False,
output_filepath: Optional[str] = None,
embedding_model: str = "openai_embed_3_large",
collection: Optional[chromadb.Collection] = None,
upsert: bool = False,
random_state: int = 42,
cache_batch: int = 32,
top_k: int = 3,
**kwargs,
) -> pd.DataFrame:
"""
Make single-hop QA dataset using existing queries with vector retrieval.
DEPRECATED.
"""
Import
from autorag.data.legacy.qacreation.base import make_single_content_qa, make_qa_with_existing_qa
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| corpus_df | pd.DataFrame | Yes | Corpus with doc_id and contents columns |
| content_size | int | Yes | Number of corpus rows to sample for QA generation |
| qa_creation_func | Callable | Yes | Backend function that accepts contents list and returns QA pairs |
| output_filepath | str | No | Path to save intermediate and final parquet file |
| upsert | bool | No | Whether to overwrite existing output file (default False) |
| random_state | int | No | Random seed for corpus sampling (default 42) |
| cache_batch | int | No | Batch size for intermediate saves (default 32) |
Outputs
| Name | Type | Description |
|---|---|---|
| qa_data | pd.DataFrame | DataFrame with columns: qid (UUID str), query (str), retrieval_gt (List[List[str]]), generation_gt (List[str]) |
| parquet file | File | Saved to output_filepath if provided |
Usage Examples
Basic QA Generation with LlamaIndex
import pandas as pd
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation.base import make_single_content_qa
from autorag.data.legacy.qacreation.llama_index import generate_qa_llama_index
from functools import partial
# 1. Load corpus
corpus_df = pd.read_parquet("./corpus.parquet")
# 2. Create QA generation function with LLM bound
llm = OpenAI(model="gpt-3.5-turbo")
qa_func = partial(generate_qa_llama_index, llm=llm)
# 3. Generate QA dataset
qa_df = make_single_content_qa(
corpus_df=corpus_df,
content_size=100,
qa_creation_func=qa_func,
output_filepath="./qa.parquet",
cache_batch=16,
)
print(qa_df.head())