Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Marker Inc Korea AutoRAG Evaluator Initialization

From Leeroopedia
Knowledge Sources
Domains Pipeline Orchestration, RAG Pipeline Optimization
Last Updated 2026-02-12 00:00 GMT

Overview

Evaluator initialization sets up the optimization trial environment by loading datasets, creating the project directory structure, copying data files, and preparing for corpus ingestion.

Description

Before any RAG pipeline optimization can take place, the system must establish a well-defined project workspace. Evaluator initialization is the bootstrapping step that transforms raw dataset paths and a target directory into a fully prepared environment ready to execute optimization trials.

The initialization process performs several critical tasks. It loads and validates both the QA dataset and the corpus dataset from parquet files, applying schema casting to ensure consistent column types. It creates the project directory if it does not already exist, sets up the data/ subdirectory within the project, and copies the QA and corpus parquet files into that subdirectory. This local copy ensures that the trial operates on a stable snapshot of the data, even if the original files are modified during a long-running trial.

A key validation step during initialization is the cross-dataset consistency check. The system verifies that every document ID referenced in the QA dataset's ground-truth retrieval column actually exists in the corpus dataset. This prevents silent failures during evaluation where retrieval metrics would be computed against missing documents.

Usage

Evaluator initialization is used whenever a new optimization trial is started or when a validator needs to create a temporary evaluation environment. It is the first step in the optimization workflow and must complete successfully before any node line evaluation can begin.

Theoretical Basis

The initialization follows a defensive setup pattern:

Step 1 -- Path validation: Verify that both the QA data path and corpus data path exist and have the .parquet extension. Raise descriptive errors immediately for any invalid paths.

Step 2 -- Data loading: Read both parquet files using the PyArrow engine. Apply cast_qa_dataset and cast_corpus_dataset to normalize column types (e.g., ensuring list columns contain the expected nested types).

Step 3 -- Cross-validation: Run validate_qa_from_corpus_dataset to confirm that all document IDs referenced in the QA ground truth are present in the corpus. This catches data preparation errors early.

Step 4 -- Project structure creation: Create the project directory and its data/ subdirectory. Copy the QA and corpus datasets into the project if they do not already exist there, using idempotent writes to support restartability.

The resulting project directory structure is:

project_dir/
    data/
        qa.parquet
        corpus.parquet
    resources/
        (BM25 indexes, vectordb config)
    0/  (trial directories)
        config.yaml
        summary.csv
        node_line_1/
            node_type_1/
            node_type_2/
    trial.json

This standardized structure enables downstream components (node line runners, the dashboard, and the deployment runner) to locate data and results by convention rather than configuration.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment