Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Marker Inc Korea AutoRAG Data Format Requirements

From Leeroopedia





Knowledge Sources
Domains RAG, Debugging
Last Updated 2026-02-08 06:00 GMT

Overview

Data format validation rules and preprocessing steps required for QA and corpus datasets to work correctly with AutoRAG pipelines.

Description

AutoRAG enforces strict data format requirements for both QA datasets and corpus datasets. QA data must be parquet files with specific columns (`qid`, `query`, `retrieval_gt`, `generation_gt`) where types are enforced at load time. Corpus data requires `doc_id`, `contents`, and `metadata` columns with automatic enrichment (datetime injection, prev/next passage linking, empty content removal, Unicode normalization). Failing to meet these requirements causes runtime errors. Additionally, DataFrame index reset is critical — non-zero indices cause result length mismatches.

Usage

Apply this heuristic when preparing data for AutoRAG, debugging data-related errors, or creating custom data pipelines. Always validate data format before running optimization or deployment.

The Insight (Rule of Thumb)

  • QA Dataset Requirements:
    • Format: Parquet file
    • Required columns: `qid` (str), `query` (str), `retrieval_gt` (List[List[str]]), `generation_gt` (List[str])
    • All text is automatically normalized (lowercased, punctuation removed, extra whitespace stripped)
    • Index must be reset: `df.reset_index(drop=True)`
  • Corpus Dataset Requirements:
    • Format: Parquet file
    • Required columns: `doc_id` (str), `contents` (str), `metadata` (dict)
    • Empty/whitespace-only contents are automatically dropped
    • `last_modified_datetime` is auto-injected into metadata if missing
    • `prev_id` and `next_id` fields are auto-added for passage augmentation support
    • Unicode in metadata strings is normalized (NFKD form)
  • Critical: Always call `df.reset_index(drop=True)` before passing DataFrames to AutoRAG. Non-zero indices cause silent result length mismatches.
  • Project Directory Isolation: Use a separate project directory for each corpus/QA dataset pair. Reusing a directory with changed data causes `doc_id not found` errors.

Reasoning

The strict data format enforcement exists because AutoRAG's evaluation pipeline chains many operations that depend on consistent column access and data alignment. The index reset requirement is documented in the troubleshooting guide as a common source of confusing errors where result DataFrames have different lengths than expected. The automatic datetime injection supports the recency filter module, which requires every document to have a timestamp. The prev/next ID fields support the passage augmentation module.

From troubleshooting: "It might be you changed your corpus data, but don't use the new project directory. In AutoRAG, the project directory must be separated for each new corpus data or QA data."

Code Evidence

QA dataset validation in `autorag/utils/preprocess.py:9-13`:

def validate_qa_dataset(df: pd.DataFrame):
    columns = ["qid", "query", "retrieval_gt", "generation_gt"]
    assert set(columns).issubset(df.columns), (
        f"df must have columns {columns}, but got {df.columns}"
    )

Corpus auto-enrichment in `autorag/utils/preprocess.py:70-128`:

def cast_corpus_dataset(df: pd.DataFrame):
    df = df.reset_index(drop=True)
    validate_corpus_dataset(df)

    # Drop empty contents
    df = df[~df["contents"].apply(lambda x: x is None or x.isspace())]

    # Auto-add datetime to metadata if missing
    def make_datetime_metadata(x):
        if x is None or x == {}:
            return {"last_modified_datetime": datetime.now()}
        elif x.get("last_modified_datetime") is None:
            return {**x, "last_modified_datetime": datetime.now()}
        return x

Text normalization at load time in `autorag/utils/preprocess.py:63-66`:

df["query"] = df["query"].apply(preprocess_text)
df["generation_gt"] = df["generation_gt"].apply(
    lambda x: list(map(preprocess_text, x))
)

Validator file checks in `autorag/validator.py:29-37`:

if not os.path.exists(qa_data_path):
    raise ValueError(f"QA data path {qa_data_path} does not exist.")
if not qa_data_path.endswith(".parquet"):
    raise ValueError(f"QA data path {qa_data_path} is not a parquet file.")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment