Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Marker Inc Korea AutoRAG Legacy QA Dataset Creation

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, QA_Generation, RAG, Evaluation
Last Updated 2026-02-08 06:00 GMT

Overview

Technique for generating synthetic QA evaluation datasets from a text corpus using LLM-driven question-answer pair creation with pluggable generation backends.

Description

Legacy QA Dataset Creation is the original approach in AutoRAG for producing evaluation datasets that measure RAG pipeline quality. The core idea is to take a corpus of text passages, sample a subset, and use a language model to generate question-answer pairs where each question can be answered from the source passage. This produces a ground-truth dataset with query, generation_gt (expected answer), and retrieval_gt (source document IDs) that can be used to evaluate retrieval accuracy and generation quality.

The legacy pipeline differs from the modern QA schema-based pipeline (which uses Corpus.sample() and QA.batch_apply()) in that it operates directly on DataFrames with a simpler orchestration pattern: sample corpus rows, run a QA creation function in batches, and assemble results. The pipeline supports multiple LLM backends (LlamaIndex, RAGAS, guidance) as pluggable strategies.

Usage

Use this principle when you need to generate evaluation datasets for testing RAG pipeline performance using the legacy data creation API. This is appropriate when working with the older AutoRAG data format or when you need the specific generation strategies provided by the legacy backends (e.g., RAGAS evolution types, guidance-based structured generation, or LlamaIndex ratio-based multi-prompt generation).

For new projects, prefer the modern QA schema-based pipeline (Corpus.sample() followed by QA.batch_apply()) which provides a more composable and extensible API.

Theoretical Basis

The core algorithm follows a three-step pattern:

Step 1: Corpus Sampling

  • Randomly select n passages from the corpus DataFrame
  • Each passage becomes a source document for QA generation

Step 2: LLM-Driven QA Generation

  • For each passage, prompt an LLM to generate question-answer pairs
  • The LLM receives the passage text and produces structured output
  • Multiple generation strategies exist:
    • Single-prompt: One prompt template applied to all passages
    • Ratio-based: Multiple prompts distributed by ratio across passages
    • Evolution-based: RAGAS creates simple, multi-context, and reasoning questions
    • Structured: Guidance library constrains output format

Pseudo-code:

# Abstract algorithm (NOT real implementation)
sampled_passages = random_sample(corpus, n=content_size)
qa_pairs = []
for batch in chunks(sampled_passages, batch_size):
    generated = llm_generate_qa(batch, prompt_template)
    qa_pairs.extend(generated)
dataset = assemble_dataframe(qa_pairs, retrieval_ground_truth)

Step 3: Dataset Assembly

  • Assign unique qid (UUID) to each QA pair
  • Map retrieval_gt from source document IDs
  • Format generation_gt as list of acceptable answers
  • Save as parquet for downstream evaluation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment