Implementation:Marker Inc Korea AutoRAG Validator Validate
| Knowledge Sources | |
|---|---|
| Domains | Configuration Management, RAG Pipeline Optimization |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for validating pipeline configurations against sampled data before running a full optimization trial, provided by the AutoRAG framework.
Description
The Validator class provides a lightweight pre-flight check for AutoRAG pipeline configurations. On initialization, it loads and casts the QA and corpus datasets from parquet files. The validate method samples a small number of QA rows, extracts the corresponding corpus documents based on ground-truth retrieval IDs, writes both to temporary parquet files, and runs a complete mini trial using the Evaluator in a temporary directory. If any part of the pipeline fails during this mini trial, the error propagates immediately, allowing users to diagnose and fix configuration problems without waiting for a full trial to fail partway through. Temporary files and directories are cleaned up after validation completes.
Usage
Import and instantiate Validator when you want to verify that a YAML configuration is compatible with your data before committing to a full optimization run. This is particularly useful during iterative configuration development or in CI/CD pipelines that test configurations against sample data. The Evaluator.start_trial method calls the Validator internally unless skip_validation=True is passed.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/validator.py (lines 18-98)
Signature
class Validator:
def __init__(self, qa_data_path: str, corpus_data_path: str):
"""
Initialize a Validator object.
:param qa_data_path: The path to the QA dataset. Must be parquet file.
:param corpus_data_path: The path to the corpus dataset. Must be parquet file.
"""
def validate(self, yaml_path: str, qa_cnt: int = 5, random_state: int = 42):
"""
Validate the YAML configuration by running a mini trial on sampled data.
:param yaml_path: The path to the YAML configuration file.
:param qa_cnt: The number of QA samples to use for validation. Default is 5.
:param random_state: Random seed for reproducible sampling. Default is 42.
"""
Import
from autorag.validator import Validator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| qa_data_path | str | yes | Path to the QA dataset in parquet format. Must exist and have a .parquet extension. |
| corpus_data_path | str | yes | Path to the corpus dataset in parquet format. Must exist and have a .parquet extension. |
| yaml_path | str | yes | Path to the YAML pipeline configuration file to validate. |
| qa_cnt | int | no | Number of QA rows to sample for the mini trial. Default is 5. If the dataset has fewer rows, all rows are used. |
| random_state | int | no | Random seed for reproducible sampling. Default is 42. |
Outputs
| Name | Type | Description |
|---|---|---|
| None | None | The method returns None on success. It raises an exception if validation fails at any point (invalid paths, malformed YAML, module errors, metric computation failures, etc.). |
Usage Examples
Basic Usage
from autorag.validator import Validator
# Initialize with dataset paths
validator = Validator(
qa_data_path="data/qa.parquet",
corpus_data_path="data/corpus.parquet",
)
# Validate a pipeline configuration
validator.validate("config/pipeline.yaml")
print("Configuration is valid!")
Custom Sample Size
from autorag.validator import Validator
validator = Validator(
qa_data_path="data/qa.parquet",
corpus_data_path="data/corpus.parquet",
)
# Use a larger sample for more thorough validation
try:
validator.validate("config/pipeline.yaml", qa_cnt=10, random_state=123)
print("Validation passed.")
except Exception as e:
print(f"Validation failed: {e}")