Implementation:Marker Inc Korea AutoRAG QA To Parquet
| Knowledge Sources | |
|---|---|
| Domains | Data Serialization, Information Retrieval, Evaluation Methodology |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for exporting QA datasets and corpus to AutoRAG-compatible parquet files and for remapping retrieval ground truths to a new corpus provided by the AutoRAG framework.
Description
The QA.to_parquet() method serializes the evaluation QA dataset and its linked corpus into two Apache Parquet files. The QA file contains only the four columns required by the AutoRAG evaluation engine: qid, query, retrieval_gt, and generation_gt. The corpus file is saved via Corpus.to_parquet(). Both save paths must end with ".parquet"; otherwise a ValueError is raised.
The QA.update_corpus() method performs corpus remapping. Given a new Corpus instance (created by re-chunking the same raw documents with different parameters), it remaps every retrieval ground truth entry from the old corpus to the new one. The remapping algorithm works by:
- Extracting the evidence path, page, and start/end character indices from each old ground-truth passage.
- Building a lookup dictionary that groups the new corpus passages by source file path.
- For each evidence entry, finding all passages in the new corpus that share the same path (and optionally the same page) and whose character index ranges overlap with the original evidence.
- Collecting the matching new document IDs as the updated retrieval ground truth.
The overlap check uses an index matching function that returns True if either endpoint of the target range falls within the destination range: (dst_start <= target_start <= dst_end) or (dst_start <= target_end <= dst_end).
Usage
Use to_parquet() as the final step of the evaluation data creation pipeline to produce the files consumed by the AutoRAG evaluation engine. Use update_corpus() when you want to evaluate the same QA pairs against a differently chunked version of the same documents.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/data/qa/schema.py (lines 175-252)
Signature
class QA:
def to_parquet(self, qa_save_path: str, corpus_save_path: str):
...
def update_corpus(self, new_corpus: Corpus) -> "QA":
...
Import
from autorag.data.qa.schema import QA
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| qa_save_path | str | yes (to_parquet) | File path for saving the QA parquet file. Must end with ".parquet". |
| corpus_save_path | str | yes (to_parquet) | File path for saving the corpus parquet file. Must end with ".parquet". |
| new_corpus | Corpus | yes (update_corpus) | A new Corpus instance created from the same Raw data with different chunking parameters. Must have valid linked_raw and columns doc_id, path, start_end_idx, metadata. |
Outputs
| Name | Type | Description |
|---|---|---|
| QA parquet file | File (parquet) | Parquet file containing columns: qid (str), query (str), retrieval_gt (List[List[str]]), generation_gt (List[str]) |
| Corpus parquet file | File (parquet) | Parquet file containing columns: doc_id (str), contents (str), path (str), start_end_idx (tuple), metadata (dict) |
| Remapped QA instance | QA (update_corpus only) | A new QA instance with retrieval_gt updated to reference doc_ids in the new corpus, linked to the new corpus |
Usage Examples
Basic Export
from autorag.data.qa.schema import Raw
from autorag.data.qa.sample import random_single_hop
from autorag.data.qa.query.llama_gen_query import factoid_query_gen
from autorag.data.qa.generation_gt.llama_index_gen_gt import make_basic_gen_gt
from autorag.data.qa.filter.dontknow import dontknow_filter_rule_based
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini")
qa = (Raw(parsed_df)
.chunk("token", chunk_size=512)
.sample(random_single_hop, n=100)
.make_retrieval_gt_contents()
.batch_apply(factoid_query_gen, llm=llm)
.batch_apply(make_basic_gen_gt, llm=llm)
.filter(dontknow_filter_rule_based, lang="en"))
# Export to parquet files
qa.to_parquet(
qa_save_path="./output/qa.parquet",
corpus_save_path="./output/corpus.parquet"
)
Corpus Remapping
from autorag.data.qa.schema import Raw
raw = Raw(parsed_df)
# Original corpus with small chunks
corpus_small = raw.chunk("token", chunk_size=256, chunk_overlap=32)
# Build QA from original corpus
qa_original = (corpus_small
.sample(random_single_hop, n=100)
.make_retrieval_gt_contents()
.batch_apply(factoid_query_gen, llm=llm)
.batch_apply(make_basic_gen_gt, llm=llm)
.filter(dontknow_filter_rule_based, lang="en"))
# Create new corpus with larger chunks from the SAME raw data
corpus_large = raw.chunk("token", chunk_size=1024, chunk_overlap=128)
# Remap QA retrieval ground truths to the new corpus
qa_remapped = qa_original.update_corpus(corpus_large)
# Export the remapped version
qa_remapped.to_parquet(
qa_save_path="./output/qa_large_chunks.parquet",
corpus_save_path="./output/corpus_large_chunks.parquet"
)
Accessing Raw Data Before Export
# Inspect the data that will be exported
print("QA columns:", qa.data.columns.tolist())
print("QA shape:", qa.data.shape)
print("Corpus shape:", qa.linked_corpus.data.shape)
# Preview first few rows
print(qa.data[["qid", "query", "retrieval_gt", "generation_gt"]].head())