Implementation:Marker Inc Korea AutoRAG Evaluator Start Trial
| Knowledge Sources | |
|---|---|
| Domains | RAG Pipeline Evaluation, Model Validation |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for evaluating an optimized RAG pipeline against a test dataset by reusing the core Evaluator trial infrastructure, provided by the AutoRAG framework.
Description
The Evaluator.start_trial method orchestrates a complete pipeline evaluation. When used for test dataset evaluation, it is called with the best.yaml configuration (the output of extract_best_config) rather than a multi-candidate search config. Because the best config has exactly one module per node, the Evaluator performs a single forward pass rather than a combinatorial search.
The method proceeds through several phases. First, it creates a resources directory and optionally validates the config YAML via the Validator class. Then it generates a new trial name (an incrementing integer), creates the trial directory, and copies the config YAML into it. Next, it handles corpus ingestion: BM25 indices are built if any node uses BM25 retrieval, and vector database embeddings are ingested if any node uses vectordb retrieval. The ingestion respects the full_ingest parameter -- when True, the entire corpus is checked against the vector store; when False, only documents referenced in retrieval ground truth are considered.
The core execution iterates over each node line in the config, calling run_node_line to process all nodes within that line. The first node line receives the QA data as its initial input. Results flow forward from each node line to the next. After all node lines complete, the per-node-line summaries are aggregated into a trial-level summary.csv containing columns for node line name, node type, best module filename, best module name, best module params, and best execution time.
Usage
Import Evaluator and call start_trial with the extracted best config YAML to evaluate the optimized pipeline on held-out test data. Create the Evaluator with test QA and corpus data paths (not the training data). This is the standard validation step before production deployment.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/evaluator.py (lines 106-219)
Signature
class Evaluator:
def __init__(self, qa_data_path: str, corpus_data_path: str,
project_dir: Optional[str] = None):
...
def start_trial(self, yaml_path: str, skip_validation: bool = False,
full_ingest: bool = True):
...
Import
from autorag.evaluator import Evaluator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| qa_data_path | str | yes | Path to the test QA dataset in parquet format (for Evaluator constructor) |
| corpus_data_path | str | yes | Path to the corpus dataset in parquet format (for Evaluator constructor) |
| project_dir | Optional[str] | no | Path to the project directory for storing trial results. Defaults to current working directory. |
| yaml_path | str | yes | Path to the best.yaml config file (output of extract_best_config) for start_trial |
| skip_validation | bool | no | If True, skips config YAML validation. Default is False. |
| full_ingest | bool | no | If True, checks the entire corpus against the vector DB for ingestion. If False, only checks documents in retrieval ground truth. Default is True. |
Outputs
| Name | Type | Description |
|---|---|---|
| trial directory | directory on disk | A new numbered trial directory (e.g., project_dir/0/) containing config.yaml and per-node-line subdirectories with evaluation results |
| summary.csv | CSV file | Trial-level summary with columns: node_line_name, node_type, best_module_filename, best_module_name, best_module_params, best_execution_time |
External Reference
This is a wrapper doc. The Evaluator.start_trial method is the same API used for the optimization phase. In the test evaluation context, it is reused with a fundamentally different config:
| Aspect | Optimization Trial | Test Evaluation |
|---|---|---|
| Config source | Multi-candidate YAML with many modules per node | best.yaml with exactly one module per node |
| Search behavior | Combinatorial: evaluates all module combinations | Single-path: runs the one specified module per node |
| QA dataset | Training QA pairs | Held-out test QA pairs |
| Purpose | Find the best module at each node | Measure generalization performance of the selected modules |
| Output interpretation | Best modules are selected from candidates | Metrics indicate real-world expected performance |
The Evaluator class itself does not distinguish between these use cases. The difference is entirely determined by the config file passed to start_trial and the QA data provided to the constructor. This reuse pattern avoids code duplication and ensures that test evaluation uses exactly the same execution infrastructure as training evaluation.
Usage Examples
Basic Usage
from autorag.evaluator import Evaluator
from autorag.deploy.base import extract_best_config
# Step 1: Extract the best config from the training trial
extract_best_config(
trial_path="./my_project/0",
output_path="./my_project/best.yaml"
)
# Step 2: Create a new Evaluator with TEST data
test_evaluator = Evaluator(
qa_data_path="./data/qa_test.parquet",
corpus_data_path="./data/corpus.parquet",
project_dir="./my_test_project"
)
# Step 3: Run the best pipeline on the test dataset
test_evaluator.start_trial(yaml_path="./my_project/best.yaml")
# Step 4: Inspect test metrics
import pandas as pd
test_summary = pd.read_csv("./my_test_project/0/summary.csv")
print(test_summary[["node_type", "best_module_name", "best_execution_time"]])
Skip Validation for Speed
from autorag.evaluator import Evaluator
test_evaluator = Evaluator(
qa_data_path="./data/qa_test.parquet",
corpus_data_path="./data/corpus.parquet",
project_dir="./my_test_project"
)
# Skip validation when you know the config is well-formed
test_evaluator.start_trial(
yaml_path="./my_project/best.yaml",
skip_validation=True,
full_ingest=False # Faster: only ingest documents referenced in retrieval GT
)