Principle:Marker Inc Korea AutoRAG Evaluator Initialization
| Knowledge Sources | |
|---|---|
| Domains | Pipeline Orchestration, RAG Pipeline Optimization |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Evaluator initialization sets up the optimization trial environment by loading datasets, creating the project directory structure, copying data files, and preparing for corpus ingestion.
Description
Before any RAG pipeline optimization can take place, the system must establish a well-defined project workspace. Evaluator initialization is the bootstrapping step that transforms raw dataset paths and a target directory into a fully prepared environment ready to execute optimization trials.
The initialization process performs several critical tasks. It loads and validates both the QA dataset and the corpus dataset from parquet files, applying schema casting to ensure consistent column types. It creates the project directory if it does not already exist, sets up the data/ subdirectory within the project, and copies the QA and corpus parquet files into that subdirectory. This local copy ensures that the trial operates on a stable snapshot of the data, even if the original files are modified during a long-running trial.
A key validation step during initialization is the cross-dataset consistency check. The system verifies that every document ID referenced in the QA dataset's ground-truth retrieval column actually exists in the corpus dataset. This prevents silent failures during evaluation where retrieval metrics would be computed against missing documents.
Usage
Evaluator initialization is used whenever a new optimization trial is started or when a validator needs to create a temporary evaluation environment. It is the first step in the optimization workflow and must complete successfully before any node line evaluation can begin.
Theoretical Basis
The initialization follows a defensive setup pattern:
Step 1 -- Path validation: Verify that both the QA data path and corpus data path exist and have the .parquet extension. Raise descriptive errors immediately for any invalid paths.
Step 2 -- Data loading: Read both parquet files using the PyArrow engine. Apply cast_qa_dataset and cast_corpus_dataset to normalize column types (e.g., ensuring list columns contain the expected nested types).
Step 3 -- Cross-validation: Run validate_qa_from_corpus_dataset to confirm that all document IDs referenced in the QA ground truth are present in the corpus. This catches data preparation errors early.
Step 4 -- Project structure creation: Create the project directory and its data/ subdirectory. Copy the QA and corpus datasets into the project if they do not already exist there, using idempotent writes to support restartability.
The resulting project directory structure is:
project_dir/
data/
qa.parquet
corpus.parquet
resources/
(BM25 indexes, vectordb config)
0/ (trial directories)
config.yaml
summary.csv
node_line_1/
node_type_1/
node_type_2/
trial.json
This standardized structure enables downstream components (node line runners, the dashboard, and the deployment runner) to locate data and results by convention rather than configuration.