Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Explodinggradients Ragas Metric Prompt Optimization

From Leeroopedia


Knowledge Sources
Domains LLMs, Evaluation, Prompt_Optimization, Meta_Learning
Last Updated 2026-02-10 06:00 GMT

Overview

End-to-end process for optimizing the internal prompts of Ragas evaluation metrics to improve alignment with human judgments using genetic algorithms or DSPy-based optimization.

Description

This workflow covers the systematic optimization of evaluation metric prompts to improve their alignment with human expert judgments. Ragas evaluation metrics (like Faithfulness, AspectCritic, FactualCorrectness) rely on LLM prompts to produce scores. These prompts can be optimized using two strategies: the GeneticOptimizer (evolutionary algorithm that mutates and selects prompt instructions) and the DSPyOptimizer (MIPROv2-based optimization via the DSPy framework). Both approaches use a human-annotated dataset as ground truth, compute a loss between metric predictions and human labels, and iteratively improve the prompts to minimize this loss.

Usage

Execute this workflow when your Ragas evaluation metrics produce scores that do not sufficiently align with human expert judgments, or when you are adapting metrics to a new domain where the default prompts underperform. You should have a dataset with human-annotated evaluation labels (at least 20-50 samples) and a target metric whose prompts you want to improve. This is an advanced workflow typically used after basic evaluation is established.

Execution Steps

Step 1: Collect Human-Annotated Data

Gather a dataset where human experts have evaluated LLM outputs on the criteria your metric measures. Each sample should include the input data (question, context, response) and the human judgment (pass/fail, score, or ranking). This annotated dataset serves as the ground truth for optimization.

Key considerations:

  • Aim for at least 20-50 annotated samples for reliable optimization
  • Human labels should cover the full range of quality levels
  • Use the EvaluationDataset schema for structured annotation
  • Inter-annotator agreement should be verified for quality
  • Annotated data can be loaded from JSON files or created programmatically

Step 2: Establish Baseline Metric Performance

Run the target metric with its default prompts on the annotated dataset. Compare metric predictions against human labels using correlation metrics (Cohen's Kappa for discrete, Pearson for numeric). This establishes the baseline alignment that optimization will improve upon.

Key considerations:

  • DiscreteMetric.get_correlation() computes Cohen's Kappa
  • NumericMetric.get_correlation() computes Pearson correlation
  • Record baseline scores for comparison after optimization
  • Identify specific samples where the metric disagrees with humans

Step 3: Configure the Optimizer

Choose and configure an optimization strategy. The GeneticOptimizer uses evolutionary algorithms to evolve prompt instructions through mutation and crossover. The DSPyOptimizer uses MIPROv2 to optimize prompts via few-shot example selection and instruction tuning. Both require an LLM for prompt generation and a loss function for fitness evaluation.

Pseudocode:

  1. For Genetic: configure population size, generations, mutation rate
  2. For DSPy: configure training samples, evaluation function
  3. Both need: LLM instance, loss function, metric to optimize

Key considerations:

  • GeneticOptimizer works well for instruction tuning without few-shot examples
  • DSPyOptimizer provides more sophisticated optimization but requires the dspy dependency
  • Loss functions available: BinaryMetricLoss (for discrete), DistanceLoss (for numeric)
  • The optimizer modifies the metric's internal prompts, not the user's application prompts

Step 4: Run the Optimization

Execute the optimization process. The optimizer iterates through multiple rounds, generating candidate prompts, evaluating them against the annotated dataset, computing fitness via the loss function, and selecting the best-performing variants. The process converges toward prompts that maximize alignment with human judgments.

Key considerations:

  • GeneticOptimizer uses LLM-based crossover and mutation operators
  • DSPyOptimizer uses MIPROv2 with bootstrapped demonstrations
  • Optimization progress can be monitored via loss values per generation
  • The process is computationally intensive (many LLM calls)
  • Use caching to avoid redundant LLM calls during optimization

Step 5: Evaluate and Deploy Optimized Prompts

Compare the optimized metric against the baseline on the annotated dataset. Verify improved alignment with human labels. Save the optimized prompts for reuse and deploy them in your evaluation pipeline. The optimized prompts can be serialized to JSON and loaded in future metric instances.

Key considerations:

  • Use save_prompts() to persist optimized prompts to disk
  • Use load_prompts() to load optimized prompts into metric instances
  • Validate on a held-out test set to ensure generalization
  • The PromptMixin provides standardized save/load/set_prompts interfaces
  • Optimized prompts can be language-adapted for non-English evaluation

Execution Diagram

GitHub URL

Workflow Repository