Implementation:Marker Inc Korea AutoRAG Validator Validate

Knowledge Sources	AutoRAG
Domains	Configuration Management, RAG Pipeline Optimization
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for validating pipeline configurations against sampled data before running a full optimization trial, provided by the AutoRAG framework.

Description

The Validator class provides a lightweight pre-flight check for AutoRAG pipeline configurations. On initialization, it loads and casts the QA and corpus datasets from parquet files. The validate method samples a small number of QA rows, extracts the corresponding corpus documents based on ground-truth retrieval IDs, writes both to temporary parquet files, and runs a complete mini trial using the Evaluator in a temporary directory. If any part of the pipeline fails during this mini trial, the error propagates immediately, allowing users to diagnose and fix configuration problems without waiting for a full trial to fail partway through. Temporary files and directories are cleaned up after validation completes.

Usage

Import and instantiate Validator when you want to verify that a YAML configuration is compatible with your data before committing to a full optimization run. This is particularly useful during iterative configuration development or in CI/CD pipelines that test configurations against sample data. The Evaluator.start_trial method calls the Validator internally unless skip_validation=True is passed.

Code Reference

Source Location

Repository: AutoRAG
File: autorag/validator.py (lines 18-98)

Signature

class Validator:
    def __init__(self, qa_data_path: str, corpus_data_path: str):
        """
        Initialize a Validator object.

        :param qa_data_path: The path to the QA dataset. Must be parquet file.
        :param corpus_data_path: The path to the corpus dataset. Must be parquet file.
        """

    def validate(self, yaml_path: str, qa_cnt: int = 5, random_state: int = 42):
        """
        Validate the YAML configuration by running a mini trial on sampled data.

        :param yaml_path: The path to the YAML configuration file.
        :param qa_cnt: The number of QA samples to use for validation. Default is 5.
        :param random_state: Random seed for reproducible sampling. Default is 42.
        """

Import

from autorag.validator import Validator

I/O Contract

Inputs

Name	Type	Required	Description
qa_data_path	str	yes	Path to the QA dataset in parquet format. Must exist and have a .parquet extension.
corpus_data_path	str	yes	Path to the corpus dataset in parquet format. Must exist and have a .parquet extension.
yaml_path	str	yes	Path to the YAML pipeline configuration file to validate.
qa_cnt	int	no	Number of QA rows to sample for the mini trial. Default is 5. If the dataset has fewer rows, all rows are used.
random_state	int	no	Random seed for reproducible sampling. Default is 42.

Outputs

Name	Type	Description
None	None	The method returns None on success. It raises an exception if validation fails at any point (invalid paths, malformed YAML, module errors, metric computation failures, etc.).

Usage Examples

Basic Usage

from autorag.validator import Validator

# Initialize with dataset paths
validator = Validator(
    qa_data_path="data/qa.parquet",
    corpus_data_path="data/corpus.parquet",
)

# Validate a pipeline configuration
validator.validate("config/pipeline.yaml")
print("Configuration is valid!")

Custom Sample Size

from autorag.validator import Validator

validator = Validator(
    qa_data_path="data/qa.parquet",
    corpus_data_path="data/corpus.parquet",
)

# Use a larger sample for more thorough validation
try:
    validator.validate("config/pipeline.yaml", qa_cnt=10, random_state=123)
    print("Validation passed.")
except Exception as e:
    print(f"Validation failed: {e}")

Related Pages

Implements Principle

Principle:Marker_Inc_Korea_AutoRAG_Configuration_Validation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment