Implementation:Marker Inc Korea AutoRAG QA To Parquet

Knowledge Sources	AutoRAG
Domains	Data Serialization, Information Retrieval, Evaluation Methodology
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for exporting QA datasets and corpus to AutoRAG-compatible parquet files and for remapping retrieval ground truths to a new corpus provided by the AutoRAG framework.

Description

The QA.to_parquet() method serializes the evaluation QA dataset and its linked corpus into two Apache Parquet files. The QA file contains only the four columns required by the AutoRAG evaluation engine: qid, query, retrieval_gt, and generation_gt. The corpus file is saved via Corpus.to_parquet(). Both save paths must end with ".parquet"; otherwise a ValueError is raised.

The QA.update_corpus() method performs corpus remapping. Given a new Corpus instance (created by re-chunking the same raw documents with different parameters), it remaps every retrieval ground truth entry from the old corpus to the new one. The remapping algorithm works by:

Extracting the evidence path, page, and start/end character indices from each old ground-truth passage.
Building a lookup dictionary that groups the new corpus passages by source file path.
For each evidence entry, finding all passages in the new corpus that share the same path (and optionally the same page) and whose character index ranges overlap with the original evidence.
Collecting the matching new document IDs as the updated retrieval ground truth.

The overlap check uses an index matching function that returns True if either endpoint of the target range falls within the destination range: (dst_start <= target_start <= dst_end) or (dst_start <= target_end <= dst_end).

Usage

Use to_parquet() as the final step of the evaluation data creation pipeline to produce the files consumed by the AutoRAG evaluation engine. Use update_corpus() when you want to evaluate the same QA pairs against a differently chunked version of the same documents.

Code Reference

Source Location

Repository: AutoRAG
File: autorag/data/qa/schema.py (lines 175-252)

Signature

class QA:
    def to_parquet(self, qa_save_path: str, corpus_save_path: str):
        ...

    def update_corpus(self, new_corpus: Corpus) -> "QA":
        ...

Import

from autorag.data.qa.schema import QA

I/O Contract

Inputs

Name	Type	Required	Description
qa_save_path	str	yes (to_parquet)	File path for saving the QA parquet file. Must end with ".parquet".
corpus_save_path	str	yes (to_parquet)	File path for saving the corpus parquet file. Must end with ".parquet".
new_corpus	Corpus	yes (update_corpus)	A new Corpus instance created from the same Raw data with different chunking parameters. Must have valid linked_raw and columns doc_id, path, start_end_idx, metadata.

Outputs

Name	Type	Description
QA parquet file	File (parquet)	Parquet file containing columns: qid (str), query (str), retrieval_gt (List[List[str]]), generation_gt (List[str])
Corpus parquet file	File (parquet)	Parquet file containing columns: doc_id (str), contents (str), path (str), start_end_idx (tuple), metadata (dict)
Remapped QA instance	QA (update_corpus only)	A new QA instance with retrieval_gt updated to reference doc_ids in the new corpus, linked to the new corpus

Usage Examples

Basic Export

from autorag.data.qa.schema import Raw
from autorag.data.qa.sample import random_single_hop
from autorag.data.qa.query.llama_gen_query import factoid_query_gen
from autorag.data.qa.generation_gt.llama_index_gen_gt import make_basic_gen_gt
from autorag.data.qa.filter.dontknow import dontknow_filter_rule_based
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")

qa = (Raw(parsed_df)
      .chunk("token", chunk_size=512)
      .sample(random_single_hop, n=100)
      .make_retrieval_gt_contents()
      .batch_apply(factoid_query_gen, llm=llm)
      .batch_apply(make_basic_gen_gt, llm=llm)
      .filter(dontknow_filter_rule_based, lang="en"))

# Export to parquet files
qa.to_parquet(
    qa_save_path="./output/qa.parquet",
    corpus_save_path="./output/corpus.parquet"
)

Corpus Remapping

from autorag.data.qa.schema import Raw

raw = Raw(parsed_df)

# Original corpus with small chunks
corpus_small = raw.chunk("token", chunk_size=256, chunk_overlap=32)

# Build QA from original corpus
qa_original = (corpus_small
    .sample(random_single_hop, n=100)
    .make_retrieval_gt_contents()
    .batch_apply(factoid_query_gen, llm=llm)
    .batch_apply(make_basic_gen_gt, llm=llm)
    .filter(dontknow_filter_rule_based, lang="en"))

# Create new corpus with larger chunks from the SAME raw data
corpus_large = raw.chunk("token", chunk_size=1024, chunk_overlap=128)

# Remap QA retrieval ground truths to the new corpus
qa_remapped = qa_original.update_corpus(corpus_large)

# Export the remapped version
qa_remapped.to_parquet(
    qa_save_path="./output/qa_large_chunks.parquet",
    corpus_save_path="./output/corpus_large_chunks.parquet"
)

Accessing Raw Data Before Export

# Inspect the data that will be exported
print("QA columns:", qa.data.columns.tolist())
print("QA shape:", qa.data.shape)
print("Corpus shape:", qa.linked_corpus.data.shape)

# Preview first few rows
print(qa.data[["qid", "query", "retrieval_gt", "generation_gt"]].head())

Related Pages

Implements Principle

Principle:Marker_Inc_Korea_AutoRAG_Export_And_Corpus_Remapping

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment