Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Marker Inc Korea AutoRAG One Dataset Per Project Directory

From Leeroopedia
Knowledge Sources
Domains Pipeline_Design, RAG, Debugging
Last Updated 2026-02-12 00:00 GMT

Overview

Each AutoRAG project directory must be used with exactly one QA and corpus dataset pair; reusing a directory with different data causes validation errors and corrupted results.

Description

AutoRAG enforces a strict one-to-one mapping between project directories and datasets. When `Evaluator.__init__` runs, it copies the QA and corpus parquet files into a `data/` subdirectory within the project folder. Subsequent trials use these cached copies. If you change your dataset but reuse the same project directory, the system will use the stale cached data, leading to `doc_id not found` validation errors and incorrect evaluation results. Additionally, passage augmenters generate new document IDs that do not exist in the original corpus, requiring `skip_validation=True` when using augmenters, or disabling the pre-trial validation step.

Usage

Apply this heuristic when managing AutoRAG project directories. Always create a new project directory when changing QA or corpus data. When using passage augmenters, disable validation or expect `doc_id not found` errors.

The Insight (Rule of Thumb)

  • Action 1: Use a separate project directory for each dataset (QA + corpus pair).
  • Action 2: Delete or rename the old project directory before re-running with new data.
  • Action 3: When using passage augmenters, set `skip_validation=True` or disable the pre-trial validation step.
  • Value: One project directory = one dataset = one set of trial results.
  • Trade-off: More disk usage from multiple project directories, but prevents data corruption and validation errors.

Reasoning

The project directory serves as a persistent workspace: datasets are cached, BM25 indexes are pickled, vector database indexes are stored, and trial results are saved. Reusing a directory with different data causes index/data mismatches that surface as cryptic `doc_id not found in corpus_data` errors. The passage augmenter issue is a known limitation: augmenters create synthetic passages with new IDs that are not present in the original corpus, so the standard validation check (which verifies all QA ground-truth IDs exist in corpus) fails.

Code Evidence

Troubleshooting documentation from `docs/source/troubleshooting.md:42-48`:

Delete the project directory or use another project directory.
...
In AutoRAG, the project directory must be separated for each new corpus data or QA data.
Which means one dataset per one project directory is needed.

Passage augmenter validation issue from `docs/source/troubleshooting.md:26-40`:

When you face error like `ValueError: doc_id: 0eec7e3a-e1c0-4d33-8cc5-7e604b30339b
not found in corpus_data.`
...
Check there is a passage augmenter on your YAML file.
The passage augmenter is not supporting a validation process now.

Jupyter event loop fix from `docs/source/troubleshooting.md:14-19`:

# If you face event loop-related issue while using Jupyter notebook:
import nest_asyncio
nest_asyncio.apply()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment