Workflow:Vibrantlabsai Ragas Prompt Optimization

Knowledge Sources	Ragas Ragas Docs Optimizer Guide
Domains	LLM_Ops, Prompt_Engineering, Optimization
Last Updated	2026-02-12 10:00 GMT

Overview

End-to-end process for systematically optimizing evaluation metric prompts using genetic algorithms or DSPy MIPROv2 to improve metric accuracy and alignment with human judgments.

Description

This workflow covers data-driven prompt optimization for Ragas evaluation metrics. When metric prompts underperform on specific domains or produce inconsistent judgments, Ragas provides two optimization approaches: a GeneticOptimizer that uses evolutionary algorithms (crossover, mutation, reverse engineering) for fast, lightweight optimization, and a DSPyOptimizer that leverages DSPy's MIPROv2 algorithm for comprehensive instruction and demonstration optimization. Both approaches require annotated datasets with ground truth scores to guide the optimization process.

Key outputs:

Optimized metric prompts with improved accuracy
Saved prompt files that can be loaded and shared
Measurable improvement metrics (e.g., +8-12% on Faithfulness)

Usage

Execute this workflow when built-in metric prompts produce inaccurate or inconsistent evaluations for your specific domain. This is appropriate when you have human-annotated evaluation data showing where the metric disagrees with expert judgment and want to systematically improve the metric's prompts rather than manually tuning them.

Execution Steps

Step 1: Prepare_Annotated_Dataset

Create an annotated dataset containing ground truth evaluation scores. Each annotation includes the metric input (e.g., question, response, context), the metric's current output, and optionally an edited/corrected output from a human expert. The annotations are structured as SampleAnnotation objects containing PromptAnnotation for each prompt in the metric.

Key considerations:

GeneticOptimizer needs at least 10 annotations
DSPyOptimizer performs best with 50+ quality annotations
Annotations should cover diverse examples including failure cases
Use is_accepted flag to mark human-validated annotations

Step 2: Select_Optimizer

Choose between GeneticOptimizer and DSPyOptimizer based on data availability, budget, and quality requirements. GeneticOptimizer is faster and cheaper, requiring fewer annotations and LLM calls, but only optimizes instructions. DSPyOptimizer is more thorough, optimizing both instructions and few-shot demonstrations, but requires more data and compute.

Selection criteria:

GeneticOptimizer: fewer than 20 examples, budget-constrained, instruction-only optimization
DSPyOptimizer: 50+ examples, quality-critical, instruction + demonstration optimization
Typical cost: Genetic uses tens of LLM calls; DSPy uses hundreds

Step 3: Configure_Optimization

Set the optimizer parameters. For GeneticOptimizer, configure population size, number of generations, and mutation rates. For DSPyOptimizer, configure num_candidates, max_bootstrapped_demos, max_labeled_demos, auto mode (light/medium/heavy), and temperature. Define a loss function that measures the gap between predicted and ground truth scores.

Key parameters:

DSPy num_candidates: Number of prompt variants to explore (default 10)
DSPy max_bootstrapped_demos: Auto-generated few-shot examples (default 5)
DSPy auto mode: Controls optimization thoroughness
Loss function: Measures alignment between predicted and ground truth scores

Step 4: Run_Optimization

Execute the optimization by calling metric.optimize_prompts() with the annotated dataset and optimizer configuration. The optimizer explores the prompt space, evaluates candidates against the annotated data, and selects the best-performing prompts.

What happens:

GeneticOptimizer: Reverse-engineers instructions from examples, applies crossover and mutation, evaluates fitness against annotations across generations
DSPyOptimizer: Generates candidate instructions, bootstraps demonstrations from examples, searches over instruction-demonstration combinations using MIPROv2

Step 5: Deploy_Optimized_Prompts

Save the optimized prompts and integrate them back into the evaluation pipeline. Optimized prompts can be saved to JSON files for version control and sharing. The metric instance with optimized prompts can be used directly in evaluate() or @experiment() calls.

Key considerations:

Save prompts using metric.save_prompts() for reproducibility
Load optimized prompts using metric.load_prompts() in production
Re-validate on a held-out test set to confirm improvements generalize
Track prompt versions alongside code and model versions

Execution Diagram

GitHub URL

Workflow Repository