Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Marker Inc Korea AutoRAG QA To Parquet

From Leeroopedia
Knowledge Sources
Domains Data Serialization, Information Retrieval, Evaluation Methodology
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for exporting QA datasets and corpus to AutoRAG-compatible parquet files and for remapping retrieval ground truths to a new corpus provided by the AutoRAG framework.

Description

The QA.to_parquet() method serializes the evaluation QA dataset and its linked corpus into two Apache Parquet files. The QA file contains only the four columns required by the AutoRAG evaluation engine: qid, query, retrieval_gt, and generation_gt. The corpus file is saved via Corpus.to_parquet(). Both save paths must end with ".parquet"; otherwise a ValueError is raised.

The QA.update_corpus() method performs corpus remapping. Given a new Corpus instance (created by re-chunking the same raw documents with different parameters), it remaps every retrieval ground truth entry from the old corpus to the new one. The remapping algorithm works by:

  1. Extracting the evidence path, page, and start/end character indices from each old ground-truth passage.
  2. Building a lookup dictionary that groups the new corpus passages by source file path.
  3. For each evidence entry, finding all passages in the new corpus that share the same path (and optionally the same page) and whose character index ranges overlap with the original evidence.
  4. Collecting the matching new document IDs as the updated retrieval ground truth.

The overlap check uses an index matching function that returns True if either endpoint of the target range falls within the destination range: (dst_start <= target_start <= dst_end) or (dst_start <= target_end <= dst_end).

Usage

Use to_parquet() as the final step of the evaluation data creation pipeline to produce the files consumed by the AutoRAG evaluation engine. Use update_corpus() when you want to evaluate the same QA pairs against a differently chunked version of the same documents.

Code Reference

Source Location

  • Repository: AutoRAG
  • File: autorag/data/qa/schema.py (lines 175-252)

Signature

class QA:
    def to_parquet(self, qa_save_path: str, corpus_save_path: str):
        ...

    def update_corpus(self, new_corpus: Corpus) -> "QA":
        ...

Import

from autorag.data.qa.schema import QA

I/O Contract

Inputs

Name Type Required Description
qa_save_path str yes (to_parquet) File path for saving the QA parquet file. Must end with ".parquet".
corpus_save_path str yes (to_parquet) File path for saving the corpus parquet file. Must end with ".parquet".
new_corpus Corpus yes (update_corpus) A new Corpus instance created from the same Raw data with different chunking parameters. Must have valid linked_raw and columns doc_id, path, start_end_idx, metadata.

Outputs

Name Type Description
QA parquet file File (parquet) Parquet file containing columns: qid (str), query (str), retrieval_gt (List[List[str]]), generation_gt (List[str])
Corpus parquet file File (parquet) Parquet file containing columns: doc_id (str), contents (str), path (str), start_end_idx (tuple), metadata (dict)
Remapped QA instance QA (update_corpus only) A new QA instance with retrieval_gt updated to reference doc_ids in the new corpus, linked to the new corpus

Usage Examples

Basic Export

from autorag.data.qa.schema import Raw
from autorag.data.qa.sample import random_single_hop
from autorag.data.qa.query.llama_gen_query import factoid_query_gen
from autorag.data.qa.generation_gt.llama_index_gen_gt import make_basic_gen_gt
from autorag.data.qa.filter.dontknow import dontknow_filter_rule_based
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")

qa = (Raw(parsed_df)
      .chunk("token", chunk_size=512)
      .sample(random_single_hop, n=100)
      .make_retrieval_gt_contents()
      .batch_apply(factoid_query_gen, llm=llm)
      .batch_apply(make_basic_gen_gt, llm=llm)
      .filter(dontknow_filter_rule_based, lang="en"))

# Export to parquet files
qa.to_parquet(
    qa_save_path="./output/qa.parquet",
    corpus_save_path="./output/corpus.parquet"
)

Corpus Remapping

from autorag.data.qa.schema import Raw

raw = Raw(parsed_df)

# Original corpus with small chunks
corpus_small = raw.chunk("token", chunk_size=256, chunk_overlap=32)

# Build QA from original corpus
qa_original = (corpus_small
    .sample(random_single_hop, n=100)
    .make_retrieval_gt_contents()
    .batch_apply(factoid_query_gen, llm=llm)
    .batch_apply(make_basic_gen_gt, llm=llm)
    .filter(dontknow_filter_rule_based, lang="en"))

# Create new corpus with larger chunks from the SAME raw data
corpus_large = raw.chunk("token", chunk_size=1024, chunk_overlap=128)

# Remap QA retrieval ground truths to the new corpus
qa_remapped = qa_original.update_corpus(corpus_large)

# Export the remapped version
qa_remapped.to_parquet(
    qa_save_path="./output/qa_large_chunks.parquet",
    corpus_save_path="./output/corpus_large_chunks.parquet"
)

Accessing Raw Data Before Export

# Inspect the data that will be exported
print("QA columns:", qa.data.columns.tolist())
print("QA shape:", qa.data.shape)
print("Corpus shape:", qa.linked_corpus.data.shape)

# Preview first few rows
print(qa.data[["qid", "query", "retrieval_gt", "generation_gt"]].head())

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment