Workflow:Marker Inc Korea AutoRAG Evaluation Data Creation

Knowledge Sources	AutoRAG AutoRAG Docs AutoRAG Paper
Domains	RAG, Data_Engineering, NLP
Last Updated	2026-02-12 12:00 GMT

Overview

End-to-end process for creating evaluation datasets (QA and Corpus) from raw documents using AutoRAG's data creation pipeline.

Description

This workflow covers the complete data preparation lifecycle required before running RAG pipeline optimization. It starts with raw documents (PDFs, text files, etc.) and produces two parquet files: a corpus dataset of chunked passages and a QA dataset of question-answer pairs with retrieval ground truth. The process uses a fluent API with method chaining through three schema classes (Raw, Corpus, QA) that track provenance across transformations. Parsing is YAML-driven with multiple backend support (LangChain, LlamaParse, Clova OCR), chunking supports both LlamaIndex and LangChain splitters, and QA generation leverages LLM-based query generation, answer generation, and quality filtering.

Usage

Execute this workflow when you have a collection of raw documents (PDFs, text files, markdown, JSON, etc.) and need to produce an evaluation dataset suitable for AutoRAG pipeline optimization. The output QA and corpus parquet files are prerequisites for running the RAG optimization workflow. This workflow is also used when you want to create multiple corpus variants (via different chunking strategies) and remap QA ground truth accordingly.

Execution Steps

Step 1: Document Parsing

Ingest raw documents from a specified directory and convert them into structured text. The Parser class takes a glob path to source documents and a YAML configuration specifying one or more parsing modules (e.g., pdfminer via LangChain, LlamaParse for cloud-based parsing, Clova OCR for scanned documents, or table_hybrid_parse for mixed content). Each module processes the documents and produces a parsed parquet file containing raw_id, contents, and file metadata.

Key considerations:

Multiple parsing modules can be configured in the YAML to handle different file formats
Each parsing module produces its own output parquet file
The table_hybrid_parse module can route pages with tables to a specialized parser
LlamaParse supports multimodal parsing for complex documents

Step 2: Document Chunking

Split parsed documents into smaller passages suitable for retrieval. The Chunker class loads a parsed parquet file and applies YAML-configured chunking strategies. Supported methods include LlamaIndex chunkers (Token, Sentence, SentenceWindow, Semantic) and LangChain chunkers (RecursiveCharacter, Character, Konlpy for Korean). Each chunk receives a unique doc_id, content text, and start/end index mapping back to the source document.

Key considerations:

Chunk size and overlap are critical parameters that affect retrieval quality
Multiple chunking strategies can be evaluated to find the optimal approach
The add_file_name parameter prepends the source filename to each chunk for context
Each chunking configuration produces a separate corpus parquet file

Step 3: Corpus and Raw Instance Construction

Load the parsed and chunked parquet files into the fluent API schema objects. Create a Raw instance from the parsed data and a Corpus instance from the chunked data, linking the Corpus to its source Raw. This establishes the provenance chain needed for QA generation and corpus remapping.

What happens:

Raw DataFrame is wrapped in a Raw schema object
Corpus DataFrame is wrapped in a Corpus schema object with a linked_raw reference
The linked_raw enables later corpus remapping via start/end index tracking

Step 4: Corpus Sampling

Select a subset of corpus passages for QA pair generation. The Corpus.sample() method applies a sampling function (e.g., random_single_hop for single-passage questions, range_single_hop for deterministic sampling) to select passages that will serve as the basis for question generation. This returns a QA instance with retrieval_gt (ground truth) already populated.

Key considerations:

random_single_hop selects n random passages for single-hop QA
The number of samples determines the size of the evaluation dataset
For multi-hop QA, multiple passages per question can be sampled

Step 5: Query Generation

Generate questions from the sampled passages using an LLM. The QA.batch_apply() method processes each sampled passage through a query generation function (e.g., factoid_query_gen for factual questions, concept_completion_query_gen for concept-based questions, or two_hop_incremental for multi-hop reasoning questions). The LLM produces natural language questions that can be answered from the passage content.

Key considerations:

Different query generation strategies produce different question types
The LLM (e.g., OpenAI GPT) is passed as a parameter
batch_apply processes questions in parallel batches for efficiency

Step 6: Answer Generation

Generate ground truth answers for each question-passage pair. Multiple answer generation passes can be chained (e.g., make_basic_gen_gt for detailed answers, make_concise_gen_gt for brief answers) to create diverse generation ground truth. Each pass appends to the generation_gt column.

Key considerations:

Multiple answer styles provide richer evaluation signals
Answers are generated from the passage content, not the LLM's parametric knowledge
Both LlamaIndex and OpenAI backends are supported

Step 7: Quality Filtering

Filter out low-quality QA pairs that would degrade evaluation accuracy. Apply filters such as dontknow_filter_rule_based (removes questions the model cannot answer from the passage), passage_dependency_filter (removes questions answerable without the passage), and LLM-based variants of these filters for higher accuracy.

Key considerations:

Rule-based filters are faster but less accurate than LLM-based filters
Filtering is crucial for evaluation dataset quality
Multiple filters can be chained sequentially

Step 8: Export and Optional Corpus Remapping

Save the final QA and corpus datasets to parquet files using QA.to_parquet(). If multiple chunking strategies were used, remap the QA ground truth to each alternative corpus using QA.update_corpus(), which matches passages by start/end index overlap with the original raw document.

Key considerations:

The to_parquet method saves both qa.parquet and corpus.parquet
Corpus remapping enables chunking optimization in the RAG pipeline
Each corpus variant needs its own QA dataset with correctly mapped retrieval_gt

Execution Diagram

GitHub URL

Workflow Repository