Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Marker Inc Korea AutoRAG Validator Validate

From Leeroopedia
Knowledge Sources
Domains Configuration Management, RAG Pipeline Optimization
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for validating pipeline configurations against sampled data before running a full optimization trial, provided by the AutoRAG framework.

Description

The Validator class provides a lightweight pre-flight check for AutoRAG pipeline configurations. On initialization, it loads and casts the QA and corpus datasets from parquet files. The validate method samples a small number of QA rows, extracts the corresponding corpus documents based on ground-truth retrieval IDs, writes both to temporary parquet files, and runs a complete mini trial using the Evaluator in a temporary directory. If any part of the pipeline fails during this mini trial, the error propagates immediately, allowing users to diagnose and fix configuration problems without waiting for a full trial to fail partway through. Temporary files and directories are cleaned up after validation completes.

Usage

Import and instantiate Validator when you want to verify that a YAML configuration is compatible with your data before committing to a full optimization run. This is particularly useful during iterative configuration development or in CI/CD pipelines that test configurations against sample data. The Evaluator.start_trial method calls the Validator internally unless skip_validation=True is passed.

Code Reference

Source Location

  • Repository: AutoRAG
  • File: autorag/validator.py (lines 18-98)

Signature

class Validator:
    def __init__(self, qa_data_path: str, corpus_data_path: str):
        """
        Initialize a Validator object.

        :param qa_data_path: The path to the QA dataset. Must be parquet file.
        :param corpus_data_path: The path to the corpus dataset. Must be parquet file.
        """

    def validate(self, yaml_path: str, qa_cnt: int = 5, random_state: int = 42):
        """
        Validate the YAML configuration by running a mini trial on sampled data.

        :param yaml_path: The path to the YAML configuration file.
        :param qa_cnt: The number of QA samples to use for validation. Default is 5.
        :param random_state: Random seed for reproducible sampling. Default is 42.
        """

Import

from autorag.validator import Validator

I/O Contract

Inputs

Name Type Required Description
qa_data_path str yes Path to the QA dataset in parquet format. Must exist and have a .parquet extension.
corpus_data_path str yes Path to the corpus dataset in parquet format. Must exist and have a .parquet extension.
yaml_path str yes Path to the YAML pipeline configuration file to validate.
qa_cnt int no Number of QA rows to sample for the mini trial. Default is 5. If the dataset has fewer rows, all rows are used.
random_state int no Random seed for reproducible sampling. Default is 42.

Outputs

Name Type Description
None None The method returns None on success. It raises an exception if validation fails at any point (invalid paths, malformed YAML, module errors, metric computation failures, etc.).

Usage Examples

Basic Usage

from autorag.validator import Validator

# Initialize with dataset paths
validator = Validator(
    qa_data_path="data/qa.parquet",
    corpus_data_path="data/corpus.parquet",
)

# Validate a pipeline configuration
validator.validate("config/pipeline.yaml")
print("Configuration is valid!")

Custom Sample Size

from autorag.validator import Validator

validator = Validator(
    qa_data_path="data/qa.parquet",
    corpus_data_path="data/corpus.parquet",
)

# Use a larger sample for more thorough validation
try:
    validator.validate("config/pipeline.yaml", qa_cnt=10, random_state=123)
    print("Validation passed.")
except Exception as e:
    print(f"Validation failed: {e}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment