Workflow:Marker Inc Korea AutoRAG RAG Pipeline Optimization
| Knowledge Sources | |
|---|---|
| Domains | RAG, LLM_Ops, Evaluation |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
End-to-end process for automatically finding the optimal RAG pipeline configuration by exhaustively evaluating module combinations against a QA evaluation dataset.
Description
This workflow is the core purpose of AutoRAG: given a QA evaluation dataset and a corpus, it systematically evaluates all combinations of RAG modules (query expansion, retrieval, reranking, prompt making, generation) specified in a YAML configuration file. The Evaluator orchestrates a trial that runs each node line sequentially, with nodes within a line evaluated exhaustively. A best-result-forward strategy propagates the winning module's output to downstream nodes. Results are scored using configurable metrics (retrieval: F1, Recall, NDCG, MRR; generation: BLEU, METEOR, ROUGE, Semantic Score, G-Eval). The output is a trial folder containing per-node results, a summary CSV identifying the best module per node, and a best.yaml configuration file.
Usage
Execute this workflow after you have prepared QA and corpus parquet files (via the Evaluation Data Creation workflow or manually). Use this when you want to determine which combination of retrieval strategy, reranker, prompt template, and LLM generator performs best for your specific dataset. The YAML config file defines the search space of modules and parameters to evaluate.
Execution Steps
Step 1: Prepare Configuration YAML
Define the pipeline search space in a YAML configuration file. The config specifies node lines (sequential groups), nodes within each line (e.g., query_expansion, lexical_retrieval, semantic_retrieval, hybrid_retrieval, passage_reranker, prompt_maker, generator), and the modules to evaluate per node with their parameter grids. Each node also specifies a strategy section defining which metrics to optimize and optional speed thresholds.
Key considerations:
- Pre-made sample YAML files are provided for common configurations
- Modules can have parameter arrays that expand into a combinatorial grid
- Vector database configuration is specified at the top level
- Environment variables (${VAR}) are supported for secrets
Step 2: Validate Configuration and Data
Optionally run the Validator to verify that the YAML configuration, QA dataset, and corpus dataset are compatible and that all required dependencies are available. The validator runs a minimal version of the full optimization to catch configuration errors, missing modules, or data format issues before committing to a full trial run.
Key considerations:
- Validation runs the full pipeline with minimal data to detect errors early
- Checks that QA dataset retrieval_gt references exist in the corpus
- Verifies that specified modules and LLM models are accessible
- Passage augmenter nodes do not support validation currently
Step 3: Initialize Evaluator and Ingest Data
Create an Evaluator instance with paths to the QA and corpus parquet files, plus a project directory for output. The Evaluator loads both datasets, validates their schemas, copies them to the project directory, and prepares the data infrastructure. This includes building the BM25 index for lexical retrieval and ingesting corpus embeddings into the configured vector database(s) for semantic retrieval.
What happens:
- QA and corpus DataFrames are loaded and schema-validated
- BM25 pickle index is built from corpus contents with language-appropriate tokenization
- Corpus embeddings are computed and ingested into vector database collections
- A trial folder is created with an incremented trial number
Step 4: Execute Node Line Evaluation
Run the optimization trial by iterating through each node line in the YAML config. For each node line, the system processes nodes sequentially. Within each node, all configured modules and their parameter combinations are evaluated against the QA dataset. Retrieval nodes compute retrieval metrics (F1, Recall, Precision, NDCG, MAP, MRR) by comparing retrieved document IDs against ground truth. Generation nodes compute generation metrics (BLEU, METEOR, ROUGE, Semantic Score, G-Eval, BERTScore) by comparing generated text against ground truth answers.
Key considerations:
- Nodes are evaluated with a best-result-forward strategy: the best upstream module's output feeds downstream nodes
- Query expansion uses indirect evaluation: expanded queries are evaluated via downstream retrieval performance
- Each module run is saved as a parquet file for later analysis
- The strategy section determines which metric combination selects the winner (mean, reciprocal_rank, or normalized_mean)
Step 5: Select Best Modules and Generate Summary
After all node evaluations complete, a summary.csv file is generated in the trial folder listing the best module, its parameters, and performance metrics for each node. The system also generates a best.yaml configuration file representing the optimal end-to-end pipeline. This summary enables comparison across trials and serves as input for pipeline extraction.
Key considerations:
- The summary includes execution time per module for speed-aware selection
- Multiple trials can be run with different YAML configs to explore different search spaces
- Trial metadata (timestamps, config hash) is recorded in trial.json
Step 6: Review Results via Dashboard
Launch the interactive Panel-based dashboard to visualize trial results. The dashboard displays per-node metric comparisons, module rankings, and allows drilling down into individual query results to understand where specific modules excel or fail.
Key considerations:
- The dashboard runs on a configurable port (default 7690)
- It reads results directly from the trial folder structure
- Comparison across trials is supported