Workflow:Marker Inc Korea AutoRAG Data Creation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Engineering, NLP |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
End-to-end process for creating evaluation-ready QA and Corpus datasets from raw documents using AutoRAG's parsing, chunking, and synthetic QA generation pipeline.
Description
This workflow covers the complete data preparation pipeline required before running RAG optimization with AutoRAG. It transforms raw documents (PDF, text, markdown, JSON) into structured evaluation datasets consisting of a Corpus (chunked passages with embeddings) and a QA dataset (question-answer pairs with retrieval ground truth). The process uses YAML-driven configuration for parsing and chunking, then a fluent Python API for QA generation using LLMs. The output is two parquet files (qa.parquet and corpus.parquet) that serve as inputs to the RAG optimization pipeline.
Usage
Execute this workflow when you have a collection of raw documents (PDFs, text files, markdown, etc.) and need to create evaluation datasets for RAG pipeline optimization. You need these datasets before you can run any AutoRAG evaluation trial. This is the mandatory first step in the AutoRAG workflow.
Execution Steps
Step 1: Document Parsing
Configure a YAML file specifying which parsing modules to use (e.g., LangChain pdfminer, LlamaParse, Clova OCR, or table hybrid parser). Initialize the Parser class with the glob pattern pointing to your raw document directory and invoke parsing. The parser reads each document, extracts text content, and produces a parsed parquet file containing raw_id and contents columns.
Key considerations:
- Choose the parser module appropriate for your document format (PDF, markdown, JSON)
- Multiple parser modules can be specified in the YAML for different document types
- The output is a parquet file in the project directory
Step 2: Document Chunking
Configure a YAML file specifying chunking strategy (LangChain or LlamaIndex backends) with parameters like chunk size, overlap, and method (Token, Sentence, etc.). Initialize the Chunker from the parsed parquet output and run chunking. This splits parsed documents into smaller passages, each with a unique doc_id, path, and start/end index metadata.
Key considerations:
- Multiple chunk configurations can be evaluated to find the optimal chunking strategy
- Each chunk method produces a different corpus, which means different retrieval ground truth
- The add_file_name parameter can prepend the source filename to each chunk for context
Step 3: Corpus Instantiation
Load the chunked parquet file into a Corpus object, linked to the corresponding Raw instance. The Corpus object holds the chunked passages and maintains the relationship back to the original parsed documents. This linkage is essential for QA ground truth generation.
Key considerations:
- The Corpus must be linked to its Raw instance for retrieval ground truth mapping
- Multiple Corpus instances can be created from the same Raw with different chunk settings
Step 4: Passage Sampling
Sample passages from the Corpus to determine which passages will serve as the basis for QA pair generation. The sampling function (e.g., random_single_hop) selects passages and creates the initial QA structure with retrieval ground truth (retrieval_gt) that maps questions to their source passages.
Key considerations:
- Single-hop sampling selects one passage per question for simple QA
- Multi-hop sampling selects multiple passages for complex reasoning questions
- The n parameter controls how many QA pairs to generate
Step 5: Query Generation
Use an LLM to generate questions from the sampled passages. The fluent API's batch_apply method sends each passage to the query generation function (e.g., factoid_query_gen) which prompts the LLM to create natural questions that the passage can answer. This produces the query column in the QA dataset.
Key considerations:
- Multiple query types are available (factoid, two-hop, concept completion, etc.)
- The LLM choice affects question quality and diversity
- Batch processing handles rate limits and parallelism automatically
Step 6: Answer Generation
Generate ground truth answers for each query-passage pair using an LLM. Multiple answer generation strategies can be chained (e.g., basic and concise answers) to create multiple ground truth answers per question. This produces the generation_gt column containing reference answers.
Key considerations:
- Multiple answer styles can be generated and stored as a list of ground truths
- Both basic and concise answer generators are commonly chained together
- The quality of answers directly affects evaluation metric reliability
Step 7: Quality Filtering and Export
Apply filters to remove low-quality QA pairs (e.g., "don't know" answers, passage-dependent questions). Then export the final QA and Corpus datasets as parquet files. These files contain the columns required by AutoRAG's evaluation pipeline: qid, query, retrieval_gt, and generation_gt for QA; doc_id, contents, and metadata for Corpus.
Key considerations:
- The dontknow filter removes answers where the LLM could not derive an answer
- The passage_dependency filter ensures questions require the source passage
- Output parquet files must follow AutoRAG's expected schema