Workflow:Marker Inc Korea AutoRAG Data Creation Pipeline

Knowledge Sources	AutoRAG AutoRAG Docs AutoRAG Paper
Domains	RAG, Data_Engineering, NLP
Last Updated	2026-02-08 06:00 GMT

Overview

End-to-end process for creating evaluation-ready QA and Corpus datasets from raw documents using AutoRAG's parsing, chunking, and synthetic QA generation pipeline.

Description

This workflow covers the complete data preparation pipeline required before running RAG optimization with AutoRAG. It transforms raw documents (PDF, text, markdown, JSON) into structured evaluation datasets consisting of a Corpus (chunked passages with embeddings) and a QA dataset (question-answer pairs with retrieval ground truth). The process uses YAML-driven configuration for parsing and chunking, then a fluent Python API for QA generation using LLMs. The output is two parquet files (qa.parquet and corpus.parquet) that serve as inputs to the RAG optimization pipeline.

Usage

Execute this workflow when you have a collection of raw documents (PDFs, text files, markdown, etc.) and need to create evaluation datasets for RAG pipeline optimization. You need these datasets before you can run any AutoRAG evaluation trial. This is the mandatory first step in the AutoRAG workflow.

Execution Steps

Step 1: Document Parsing

Configure a YAML file specifying which parsing modules to use (e.g., LangChain pdfminer, LlamaParse, Clova OCR, or table hybrid parser). Initialize the Parser class with the glob pattern pointing to your raw document directory and invoke parsing. The parser reads each document, extracts text content, and produces a parsed parquet file containing raw_id and contents columns.

Key considerations:

Choose the parser module appropriate for your document format (PDF, markdown, JSON)
Multiple parser modules can be specified in the YAML for different document types
The output is a parquet file in the project directory

Step 2: Document Chunking

Configure a YAML file specifying chunking strategy (LangChain or LlamaIndex backends) with parameters like chunk size, overlap, and method (Token, Sentence, etc.). Initialize the Chunker from the parsed parquet output and run chunking. This splits parsed documents into smaller passages, each with a unique doc_id, path, and start/end index metadata.

Key considerations:

Multiple chunk configurations can be evaluated to find the optimal chunking strategy
Each chunk method produces a different corpus, which means different retrieval ground truth
The add_file_name parameter can prepend the source filename to each chunk for context

Step 3: Corpus Instantiation

Load the chunked parquet file into a Corpus object, linked to the corresponding Raw instance. The Corpus object holds the chunked passages and maintains the relationship back to the original parsed documents. This linkage is essential for QA ground truth generation.

Key considerations:

The Corpus must be linked to its Raw instance for retrieval ground truth mapping
Multiple Corpus instances can be created from the same Raw with different chunk settings

Step 4: Passage Sampling

Sample passages from the Corpus to determine which passages will serve as the basis for QA pair generation. The sampling function (e.g., random_single_hop) selects passages and creates the initial QA structure with retrieval ground truth (retrieval_gt) that maps questions to their source passages.

Key considerations:

Single-hop sampling selects one passage per question for simple QA
Multi-hop sampling selects multiple passages for complex reasoning questions
The n parameter controls how many QA pairs to generate

Step 5: Query Generation

Use an LLM to generate questions from the sampled passages. The fluent API's batch_apply method sends each passage to the query generation function (e.g., factoid_query_gen) which prompts the LLM to create natural questions that the passage can answer. This produces the query column in the QA dataset.

Key considerations:

Multiple query types are available (factoid, two-hop, concept completion, etc.)
The LLM choice affects question quality and diversity
Batch processing handles rate limits and parallelism automatically

Step 6: Answer Generation

Generate ground truth answers for each query-passage pair using an LLM. Multiple answer generation strategies can be chained (e.g., basic and concise answers) to create multiple ground truth answers per question. This produces the generation_gt column containing reference answers.

Key considerations:

Multiple answer styles can be generated and stored as a list of ground truths
Both basic and concise answer generators are commonly chained together
The quality of answers directly affects evaluation metric reliability

Step 7: Quality Filtering and Export

Apply filters to remove low-quality QA pairs (e.g., "don't know" answers, passage-dependent questions). Then export the final QA and Corpus datasets as parquet files. These files contain the columns required by AutoRAG's evaluation pipeline: qid, query, retrieval_gt, and generation_gt for QA; doc_id, contents, and metadata for Corpus.

Key considerations:

The dontknow filter removes answers where the LLM could not derive an answer
The passage_dependency filter ensures questions require the source passage
Output parquet files must follow AutoRAG's expected schema

Execution Diagram

GitHub URL

Workflow Repository