Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Marker Inc Korea AutoRAG Test Dataset Evaluation

From Leeroopedia
Knowledge Sources
Domains RAG Pipeline Evaluation, Model Validation
Last Updated 2026-02-12 00:00 GMT

Overview

Test dataset evaluation measures the generalization performance of an optimized RAG pipeline by running it against held-out QA pairs that were not used during the optimization trial.

Description

AutoRAG's optimization process (the Evaluator trial) searches through combinations of modules to find the best pipeline configuration. However, this optimization is performed on a training QA dataset, and the resulting best configuration may overfit to the specific characteristics of those questions and answers. Test dataset evaluation addresses this by running the same best pipeline against a separate test QA dataset to measure how well it generalizes.

The workflow reuses the core Evaluator.start_trial method, but in a fundamentally different context. During optimization, the trial config contains multiple candidate modules per node, and the Evaluator explores all combinations. During test evaluation, the config is the extracted best config (from extract_best_config), which has exactly one module per node. This means the Evaluator runs a single path through the pipeline rather than a combinatorial search, effectively functioning as a simple forward pass with metric computation.

This reuse pattern is a wrapper usage: the same infrastructure that performs exhaustive search is repurposed for single-configuration evaluation by passing it a config where no search is needed. The output is a new trial directory with a summary.csv containing performance metrics (retrieval recall, generation quality, etc.) on the test data, enabling direct comparison with the training trial results.

Usage

Use test dataset evaluation after optimization is complete and the best config has been extracted. It is the standard validation step before deploying a pipeline to production. A separate Evaluator instance should be created with test QA and corpus data (not the training data). The extracted best YAML file is passed to start_trial to evaluate the optimized pipeline on the unseen test set.

Theoretical Basis

Test dataset evaluation follows the standard train/test split methodology from machine learning:

1. Split QA dataset into QA_train and QA_test
2. Optimization phase:
   evaluator_train = Evaluator(QA_train, corpus)
   evaluator_train.start_trial(multi_candidate_config.yaml)
   -> Produces summary.csv with best modules per node

3. Config extraction:
   best_config = extract_best_config(train_trial_path)
   -> Produces best.yaml with one module per node

4. Test evaluation:
   evaluator_test = Evaluator(QA_test, corpus)
   evaluator_test.start_trial(best.yaml)
   -> Produces test summary.csv with generalization metrics

Key properties:

  • No information leakage: The test QA pairs must not have been seen during optimization. The same corpus can be reused because the corpus is not optimized -- only the module selection is.
  • Single-path execution: Because best.yaml has one module per node, the Evaluator performs no search. Each node runs exactly one module, and the "best" is trivially the only candidate.
  • Metric comparability: The test summary.csv uses the same metric columns as the training summary.csv, enabling direct numerical comparison to detect overfitting (where training metrics are much better than test metrics).
  • Corpus ingestion: BM25 and vector database ingestion still occurs during test evaluation, but only for the specific modules in the best config (not for all candidates as in the training trial).

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment