Workflow:Marker Inc Korea AutoRAG Evaluation Data Creation
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Engineering, NLP |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
End-to-end process for creating evaluation datasets (QA and Corpus) from raw documents using AutoRAG's data creation pipeline.
Description
This workflow covers the complete data preparation lifecycle required before running RAG pipeline optimization. It starts with raw documents (PDFs, text files, etc.) and produces two parquet files: a corpus dataset of chunked passages and a QA dataset of question-answer pairs with retrieval ground truth. The process uses a fluent API with method chaining through three schema classes (Raw, Corpus, QA) that track provenance across transformations. Parsing is YAML-driven with multiple backend support (LangChain, LlamaParse, Clova OCR), chunking supports both LlamaIndex and LangChain splitters, and QA generation leverages LLM-based query generation, answer generation, and quality filtering.
Usage
Execute this workflow when you have a collection of raw documents (PDFs, text files, markdown, JSON, etc.) and need to produce an evaluation dataset suitable for AutoRAG pipeline optimization. The output QA and corpus parquet files are prerequisites for running the RAG optimization workflow. This workflow is also used when you want to create multiple corpus variants (via different chunking strategies) and remap QA ground truth accordingly.
Execution Steps
Step 1: Document Parsing
Ingest raw documents from a specified directory and convert them into structured text. The Parser class takes a glob path to source documents and a YAML configuration specifying one or more parsing modules (e.g., pdfminer via LangChain, LlamaParse for cloud-based parsing, Clova OCR for scanned documents, or table_hybrid_parse for mixed content). Each module processes the documents and produces a parsed parquet file containing raw_id, contents, and file metadata.
Key considerations:
- Multiple parsing modules can be configured in the YAML to handle different file formats
- Each parsing module produces its own output parquet file
- The table_hybrid_parse module can route pages with tables to a specialized parser
- LlamaParse supports multimodal parsing for complex documents
Step 2: Document Chunking
Split parsed documents into smaller passages suitable for retrieval. The Chunker class loads a parsed parquet file and applies YAML-configured chunking strategies. Supported methods include LlamaIndex chunkers (Token, Sentence, SentenceWindow, Semantic) and LangChain chunkers (RecursiveCharacter, Character, Konlpy for Korean). Each chunk receives a unique doc_id, content text, and start/end index mapping back to the source document.
Key considerations:
- Chunk size and overlap are critical parameters that affect retrieval quality
- Multiple chunking strategies can be evaluated to find the optimal approach
- The add_file_name parameter prepends the source filename to each chunk for context
- Each chunking configuration produces a separate corpus parquet file
Step 3: Corpus and Raw Instance Construction
Load the parsed and chunked parquet files into the fluent API schema objects. Create a Raw instance from the parsed data and a Corpus instance from the chunked data, linking the Corpus to its source Raw. This establishes the provenance chain needed for QA generation and corpus remapping.
What happens:
- Raw DataFrame is wrapped in a Raw schema object
- Corpus DataFrame is wrapped in a Corpus schema object with a linked_raw reference
- The linked_raw enables later corpus remapping via start/end index tracking
Step 4: Corpus Sampling
Select a subset of corpus passages for QA pair generation. The Corpus.sample() method applies a sampling function (e.g., random_single_hop for single-passage questions, range_single_hop for deterministic sampling) to select passages that will serve as the basis for question generation. This returns a QA instance with retrieval_gt (ground truth) already populated.
Key considerations:
- random_single_hop selects n random passages for single-hop QA
- The number of samples determines the size of the evaluation dataset
- For multi-hop QA, multiple passages per question can be sampled
Step 5: Query Generation
Generate questions from the sampled passages using an LLM. The QA.batch_apply() method processes each sampled passage through a query generation function (e.g., factoid_query_gen for factual questions, concept_completion_query_gen for concept-based questions, or two_hop_incremental for multi-hop reasoning questions). The LLM produces natural language questions that can be answered from the passage content.
Key considerations:
- Different query generation strategies produce different question types
- The LLM (e.g., OpenAI GPT) is passed as a parameter
- batch_apply processes questions in parallel batches for efficiency
Step 6: Answer Generation
Generate ground truth answers for each question-passage pair. Multiple answer generation passes can be chained (e.g., make_basic_gen_gt for detailed answers, make_concise_gen_gt for brief answers) to create diverse generation ground truth. Each pass appends to the generation_gt column.
Key considerations:
- Multiple answer styles provide richer evaluation signals
- Answers are generated from the passage content, not the LLM's parametric knowledge
- Both LlamaIndex and OpenAI backends are supported
Step 7: Quality Filtering
Filter out low-quality QA pairs that would degrade evaluation accuracy. Apply filters such as dontknow_filter_rule_based (removes questions the model cannot answer from the passage), passage_dependency_filter (removes questions answerable without the passage), and LLM-based variants of these filters for higher accuracy.
Key considerations:
- Rule-based filters are faster but less accurate than LLM-based filters
- Filtering is crucial for evaluation dataset quality
- Multiple filters can be chained sequentially
Step 8: Export and Optional Corpus Remapping
Save the final QA and corpus datasets to parquet files using QA.to_parquet(). If multiple chunking strategies were used, remap the QA ground truth to each alternative corpus using QA.update_corpus(), which matches passages by start/end index overlap with the original raw document.
Key considerations:
- The to_parquet method saves both qa.parquet and corpus.parquet
- Corpus remapping enables chunking optimization in the RAG pipeline
- Each corpus variant needs its own QA dataset with correctly mapped retrieval_gt