Workflow:Marker Inc Korea AutoRAG RAG Pipeline Optimization

Knowledge Sources	AutoRAG AutoRAG Docs AutoRAG Paper
Domains	RAG, LLM_Ops, Evaluation
Last Updated	2026-02-12 12:00 GMT

Overview

End-to-end process for automatically finding the optimal RAG pipeline configuration by exhaustively evaluating module combinations against a QA evaluation dataset.

Description

This workflow is the core purpose of AutoRAG: given a QA evaluation dataset and a corpus, it systematically evaluates all combinations of RAG modules (query expansion, retrieval, reranking, prompt making, generation) specified in a YAML configuration file. The Evaluator orchestrates a trial that runs each node line sequentially, with nodes within a line evaluated exhaustively. A best-result-forward strategy propagates the winning module's output to downstream nodes. Results are scored using configurable metrics (retrieval: F1, Recall, NDCG, MRR; generation: BLEU, METEOR, ROUGE, Semantic Score, G-Eval). The output is a trial folder containing per-node results, a summary CSV identifying the best module per node, and a best.yaml configuration file.

Usage

Execute this workflow after you have prepared QA and corpus parquet files (via the Evaluation Data Creation workflow or manually). Use this when you want to determine which combination of retrieval strategy, reranker, prompt template, and LLM generator performs best for your specific dataset. The YAML config file defines the search space of modules and parameters to evaluate.

Execution Steps

Step 1: Prepare Configuration YAML

Define the pipeline search space in a YAML configuration file. The config specifies node lines (sequential groups), nodes within each line (e.g., query_expansion, lexical_retrieval, semantic_retrieval, hybrid_retrieval, passage_reranker, prompt_maker, generator), and the modules to evaluate per node with their parameter grids. Each node also specifies a strategy section defining which metrics to optimize and optional speed thresholds.

Key considerations:

Pre-made sample YAML files are provided for common configurations
Modules can have parameter arrays that expand into a combinatorial grid
Vector database configuration is specified at the top level
Environment variables (${VAR}) are supported for secrets

Step 2: Validate Configuration and Data

Optionally run the Validator to verify that the YAML configuration, QA dataset, and corpus dataset are compatible and that all required dependencies are available. The validator runs a minimal version of the full optimization to catch configuration errors, missing modules, or data format issues before committing to a full trial run.

Key considerations:

Validation runs the full pipeline with minimal data to detect errors early
Checks that QA dataset retrieval_gt references exist in the corpus
Verifies that specified modules and LLM models are accessible
Passage augmenter nodes do not support validation currently

Step 3: Initialize Evaluator and Ingest Data

Create an Evaluator instance with paths to the QA and corpus parquet files, plus a project directory for output. The Evaluator loads both datasets, validates their schemas, copies them to the project directory, and prepares the data infrastructure. This includes building the BM25 index for lexical retrieval and ingesting corpus embeddings into the configured vector database(s) for semantic retrieval.

What happens:

QA and corpus DataFrames are loaded and schema-validated
BM25 pickle index is built from corpus contents with language-appropriate tokenization
Corpus embeddings are computed and ingested into vector database collections
A trial folder is created with an incremented trial number

Step 4: Execute Node Line Evaluation

Run the optimization trial by iterating through each node line in the YAML config. For each node line, the system processes nodes sequentially. Within each node, all configured modules and their parameter combinations are evaluated against the QA dataset. Retrieval nodes compute retrieval metrics (F1, Recall, Precision, NDCG, MAP, MRR) by comparing retrieved document IDs against ground truth. Generation nodes compute generation metrics (BLEU, METEOR, ROUGE, Semantic Score, G-Eval, BERTScore) by comparing generated text against ground truth answers.

Key considerations:

Nodes are evaluated with a best-result-forward strategy: the best upstream module's output feeds downstream nodes
Query expansion uses indirect evaluation: expanded queries are evaluated via downstream retrieval performance
Each module run is saved as a parquet file for later analysis
The strategy section determines which metric combination selects the winner (mean, reciprocal_rank, or normalized_mean)

Step 5: Select Best Modules and Generate Summary

After all node evaluations complete, a summary.csv file is generated in the trial folder listing the best module, its parameters, and performance metrics for each node. The system also generates a best.yaml configuration file representing the optimal end-to-end pipeline. This summary enables comparison across trials and serves as input for pipeline extraction.

Key considerations:

The summary includes execution time per module for speed-aware selection
Multiple trials can be run with different YAML configs to explore different search spaces
Trial metadata (timestamps, config hash) is recorded in trial.json

Step 6: Review Results via Dashboard

Launch the interactive Panel-based dashboard to visualize trial results. The dashboard displays per-node metric comparisons, module rankings, and allows drilling down into individual query results to understand where specific modules excel or fail.

Key considerations:

The dashboard runs on a configurable port (default 7690)
It reads results directly from the trial folder structure
Comparison across trials is supported

Execution Diagram

GitHub URL

Workflow Repository