Workflow:Vibrantlabsai Ragas Prompt Optimization
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Prompt_Engineering, Optimization |
| Last Updated | 2026-02-12 10:00 GMT |
Overview
End-to-end process for systematically optimizing evaluation metric prompts using genetic algorithms or DSPy MIPROv2 to improve metric accuracy and alignment with human judgments.
Description
This workflow covers data-driven prompt optimization for Ragas evaluation metrics. When metric prompts underperform on specific domains or produce inconsistent judgments, Ragas provides two optimization approaches: a GeneticOptimizer that uses evolutionary algorithms (crossover, mutation, reverse engineering) for fast, lightweight optimization, and a DSPyOptimizer that leverages DSPy's MIPROv2 algorithm for comprehensive instruction and demonstration optimization. Both approaches require annotated datasets with ground truth scores to guide the optimization process.
Key outputs:
- Optimized metric prompts with improved accuracy
- Saved prompt files that can be loaded and shared
- Measurable improvement metrics (e.g., +8-12% on Faithfulness)
Usage
Execute this workflow when built-in metric prompts produce inaccurate or inconsistent evaluations for your specific domain. This is appropriate when you have human-annotated evaluation data showing where the metric disagrees with expert judgment and want to systematically improve the metric's prompts rather than manually tuning them.
Execution Steps
Step 1: Prepare_Annotated_Dataset
Create an annotated dataset containing ground truth evaluation scores. Each annotation includes the metric input (e.g., question, response, context), the metric's current output, and optionally an edited/corrected output from a human expert. The annotations are structured as SampleAnnotation objects containing PromptAnnotation for each prompt in the metric.
Key considerations:
- GeneticOptimizer needs at least 10 annotations
- DSPyOptimizer performs best with 50+ quality annotations
- Annotations should cover diverse examples including failure cases
- Use is_accepted flag to mark human-validated annotations
Step 2: Select_Optimizer
Choose between GeneticOptimizer and DSPyOptimizer based on data availability, budget, and quality requirements. GeneticOptimizer is faster and cheaper, requiring fewer annotations and LLM calls, but only optimizes instructions. DSPyOptimizer is more thorough, optimizing both instructions and few-shot demonstrations, but requires more data and compute.
Selection criteria:
- GeneticOptimizer: fewer than 20 examples, budget-constrained, instruction-only optimization
- DSPyOptimizer: 50+ examples, quality-critical, instruction + demonstration optimization
- Typical cost: Genetic uses tens of LLM calls; DSPy uses hundreds
Step 3: Configure_Optimization
Set the optimizer parameters. For GeneticOptimizer, configure population size, number of generations, and mutation rates. For DSPyOptimizer, configure num_candidates, max_bootstrapped_demos, max_labeled_demos, auto mode (light/medium/heavy), and temperature. Define a loss function that measures the gap between predicted and ground truth scores.
Key parameters:
- DSPy num_candidates: Number of prompt variants to explore (default 10)
- DSPy max_bootstrapped_demos: Auto-generated few-shot examples (default 5)
- DSPy auto mode: Controls optimization thoroughness
- Loss function: Measures alignment between predicted and ground truth scores
Step 4: Run_Optimization
Execute the optimization by calling metric.optimize_prompts() with the annotated dataset and optimizer configuration. The optimizer explores the prompt space, evaluates candidates against the annotated data, and selects the best-performing prompts.
What happens:
- GeneticOptimizer: Reverse-engineers instructions from examples, applies crossover and mutation, evaluates fitness against annotations across generations
- DSPyOptimizer: Generates candidate instructions, bootstraps demonstrations from examples, searches over instruction-demonstration combinations using MIPROv2
Step 5: Deploy_Optimized_Prompts
Save the optimized prompts and integrate them back into the evaluation pipeline. Optimized prompts can be saved to JSON files for version control and sharing. The metric instance with optimized prompts can be used directly in evaluate() or @experiment() calls.
Key considerations:
- Save prompts using metric.save_prompts() for reproducibility
- Load optimized prompts using metric.load_prompts() in production
- Re-validate on a held-out test set to confirm improvements generalize
- Track prompt versions alongside code and model versions