Principle:Marker Inc Korea AutoRAG Legacy QA Dataset Creation

Knowledge Sources	Marker_Inc_Korea_AutoRAG AutoRAG Legacy Tutorial RAGAS
Domains	Data_Engineering, QA_Generation, RAG, Evaluation
Last Updated	2026-02-08 06:00 GMT

Overview

Technique for generating synthetic QA evaluation datasets from a text corpus using LLM-driven question-answer pair creation with pluggable generation backends.

Description

Legacy QA Dataset Creation is the original approach in AutoRAG for producing evaluation datasets that measure RAG pipeline quality. The core idea is to take a corpus of text passages, sample a subset, and use a language model to generate question-answer pairs where each question can be answered from the source passage. This produces a ground-truth dataset with query, generation_gt (expected answer), and retrieval_gt (source document IDs) that can be used to evaluate retrieval accuracy and generation quality.

The legacy pipeline differs from the modern QA schema-based pipeline (which uses Corpus.sample() and QA.batch_apply()) in that it operates directly on DataFrames with a simpler orchestration pattern: sample corpus rows, run a QA creation function in batches, and assemble results. The pipeline supports multiple LLM backends (LlamaIndex, RAGAS, guidance) as pluggable strategies.

Usage

Use this principle when you need to generate evaluation datasets for testing RAG pipeline performance using the legacy data creation API. This is appropriate when working with the older AutoRAG data format or when you need the specific generation strategies provided by the legacy backends (e.g., RAGAS evolution types, guidance-based structured generation, or LlamaIndex ratio-based multi-prompt generation).

For new projects, prefer the modern QA schema-based pipeline (Corpus.sample() followed by QA.batch_apply()) which provides a more composable and extensible API.

Theoretical Basis

The core algorithm follows a three-step pattern:

Step 1: Corpus Sampling

Randomly select n passages from the corpus DataFrame
Each passage becomes a source document for QA generation

Step 2: LLM-Driven QA Generation

For each passage, prompt an LLM to generate question-answer pairs
The LLM receives the passage text and produces structured output
Multiple generation strategies exist:
- Single-prompt: One prompt template applied to all passages
- Ratio-based: Multiple prompts distributed by ratio across passages
- Evolution-based: RAGAS creates simple, multi-context, and reasoning questions
- Structured: Guidance library constrains output format

Pseudo-code:

# Abstract algorithm (NOT real implementation)
sampled_passages = random_sample(corpus, n=content_size)
qa_pairs = []
for batch in chunks(sampled_passages, batch_size):
    generated = llm_generate_qa(batch, prompt_template)
    qa_pairs.extend(generated)
dataset = assemble_dataframe(qa_pairs, retrieval_ground_truth)

Step 3: Dataset Assembly

Assign unique qid (UUID) to each QA pair
Map retrieval_gt from source document IDs
Format generation_gt as list of acceptable answers
Save as parquet for downstream evaluation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment