Implementation:Marker Inc Korea AutoRAG Evaluator Start Trial

Knowledge Sources	AutoRAG
Domains	RAG Pipeline Evaluation, Model Validation
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for evaluating an optimized RAG pipeline against a test dataset by reusing the core Evaluator trial infrastructure, provided by the AutoRAG framework.

Description

The Evaluator.start_trial method orchestrates a complete pipeline evaluation. When used for test dataset evaluation, it is called with the best.yaml configuration (the output of extract_best_config) rather than a multi-candidate search config. Because the best config has exactly one module per node, the Evaluator performs a single forward pass rather than a combinatorial search.

The method proceeds through several phases. First, it creates a resources directory and optionally validates the config YAML via the Validator class. Then it generates a new trial name (an incrementing integer), creates the trial directory, and copies the config YAML into it. Next, it handles corpus ingestion: BM25 indices are built if any node uses BM25 retrieval, and vector database embeddings are ingested if any node uses vectordb retrieval. The ingestion respects the full_ingest parameter -- when True, the entire corpus is checked against the vector store; when False, only documents referenced in retrieval ground truth are considered.

The core execution iterates over each node line in the config, calling run_node_line to process all nodes within that line. The first node line receives the QA data as its initial input. Results flow forward from each node line to the next. After all node lines complete, the per-node-line summaries are aggregated into a trial-level summary.csv containing columns for node line name, node type, best module filename, best module name, best module params, and best execution time.

Usage

Import Evaluator and call start_trial with the extracted best config YAML to evaluate the optimized pipeline on held-out test data. Create the Evaluator with test QA and corpus data paths (not the training data). This is the standard validation step before production deployment.

Code Reference

Source Location

Repository: AutoRAG
File: autorag/evaluator.py (lines 106-219)

Signature

class Evaluator:
    def __init__(self, qa_data_path: str, corpus_data_path: str,
                 project_dir: Optional[str] = None):
        ...

    def start_trial(self, yaml_path: str, skip_validation: bool = False,
                    full_ingest: bool = True):
        ...

Import

from autorag.evaluator import Evaluator

I/O Contract

Inputs

Name	Type	Required	Description
qa_data_path	str	yes	Path to the test QA dataset in parquet format (for Evaluator constructor)
corpus_data_path	str	yes	Path to the corpus dataset in parquet format (for Evaluator constructor)
project_dir	Optional[str]	no	Path to the project directory for storing trial results. Defaults to current working directory.
yaml_path	str	yes	Path to the best.yaml config file (output of extract_best_config) for start_trial
skip_validation	bool	no	If True, skips config YAML validation. Default is False.
full_ingest	bool	no	If True, checks the entire corpus against the vector DB for ingestion. If False, only checks documents in retrieval ground truth. Default is True.

Outputs

Name	Type	Description
trial directory	directory on disk	A new numbered trial directory (e.g., project_dir/0/) containing config.yaml and per-node-line subdirectories with evaluation results
summary.csv	CSV file	Trial-level summary with columns: node_line_name, node_type, best_module_filename, best_module_name, best_module_params, best_execution_time

External Reference

This is a wrapper doc. The Evaluator.start_trial method is the same API used for the optimization phase. In the test evaluation context, it is reused with a fundamentally different config:

Aspect	Optimization Trial	Test Evaluation
Config source	Multi-candidate YAML with many modules per node	best.yaml with exactly one module per node
Search behavior	Combinatorial: evaluates all module combinations	Single-path: runs the one specified module per node
QA dataset	Training QA pairs	Held-out test QA pairs
Purpose	Find the best module at each node	Measure generalization performance of the selected modules
Output interpretation	Best modules are selected from candidates	Metrics indicate real-world expected performance

The Evaluator class itself does not distinguish between these use cases. The difference is entirely determined by the config file passed to start_trial and the QA data provided to the constructor. This reuse pattern avoids code duplication and ensures that test evaluation uses exactly the same execution infrastructure as training evaluation.

Usage Examples

Basic Usage

from autorag.evaluator import Evaluator
from autorag.deploy.base import extract_best_config

# Step 1: Extract the best config from the training trial
extract_best_config(
    trial_path="./my_project/0",
    output_path="./my_project/best.yaml"
)

# Step 2: Create a new Evaluator with TEST data
test_evaluator = Evaluator(
    qa_data_path="./data/qa_test.parquet",
    corpus_data_path="./data/corpus.parquet",
    project_dir="./my_test_project"
)

# Step 3: Run the best pipeline on the test dataset
test_evaluator.start_trial(yaml_path="./my_project/best.yaml")

# Step 4: Inspect test metrics
import pandas as pd
test_summary = pd.read_csv("./my_test_project/0/summary.csv")
print(test_summary[["node_type", "best_module_name", "best_execution_time"]])

Skip Validation for Speed

from autorag.evaluator import Evaluator

test_evaluator = Evaluator(
    qa_data_path="./data/qa_test.parquet",
    corpus_data_path="./data/corpus.parquet",
    project_dir="./my_test_project"
)

# Skip validation when you know the config is well-formed
test_evaluator.start_trial(
    yaml_path="./my_project/best.yaml",
    skip_validation=True,
    full_ingest=False  # Faster: only ingest documents referenced in retrieval GT
)

Related Pages

Implements Principle

Principle:Marker_Inc_Korea_AutoRAG_Test_Dataset_Evaluation

Requires Environment

Environment:Marker_Inc_Korea_AutoRAG_Python_3_10_Runtime

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment