Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Explodinggradients Ragas Prompt Iteration Comparison

From Leeroopedia
Revision as of 17:25, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Explodinggradients_Ragas_Prompt_Iteration_Comparison.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources Domains Last Updated
examples/iterate_prompt/evals.py, examples/iterate_prompt/run_prompt.py Prompt Engineering, A/B Testing, Experiment Comparison 2026-02-10

Overview

Description

The Prompt Iteration Comparison principle establishes a systematic approach to A/B comparison of prompt versions using side-by-side experiment results. Rather than manually inspecting outputs or relying on subjective judgment, this principle requires that prompt improvement be driven by controlled experiments where identical datasets are processed through different prompt versions and the resulting scores are compared quantitatively. Alignment by a shared sample ID enables per-example analysis, revealing not just aggregate improvements but also which specific cases improved or regressed.

Usage

When iterating on a prompt, a practitioner:

  1. Defines a baseline prompt and runs it through the Ragas experiment framework against a fixed evaluation dataset
  2. Analyzes the baseline results to identify failure patterns and areas for improvement
  3. Modifies the prompt (adjusting instructions, examples, output format, etc.)
  4. Runs the modified prompt through the same experiment framework against the same dataset
  5. Uses the comparison function to merge the two experiment result CSVs by sample ID, producing a combined view with per-experiment response and score columns
  6. Examines aggregate accuracy changes and per-example score differences to determine whether the prompt change is an improvement

This workflow can be repeated iteratively, comparing each new version against the previous best, building a chain of evidence for prompt quality over time.

Theoretical Basis

Controlled Comparison

Prompt engineering is often treated as an art, but rigorous improvement requires the same controlled comparison methodology used in A/B testing:

  • Fixed dataset: Both prompt versions are evaluated against exactly the same inputs. If the dataset changes between runs, score differences cannot be attributed to the prompt change alone.
  • Fixed evaluation criteria: The same metrics (e.g., labels_exact_match, priority_accuracy) are applied to both versions. Changing the metric between runs invalidates the comparison.
  • Sample-level alignment: By requiring an id column in the dataset and aligning results by this ID, the comparison reveals per-example behavior changes. An aggregate accuracy improvement of 5% could mask the fact that the new prompt fixed 10 cases but broke 5 others -- per-example alignment surfaces this.

Why Per-Example Analysis Matters

Aggregate metrics (overall accuracy, mean score) hide important information:

  • Regressions: A prompt change that improves average performance may introduce regressions on specific cases. Per-example comparison catches these.
  • Failure clustering: By examining which specific examples changed from "incorrect" to "correct" (or vice versa), the practitioner can identify patterns -- perhaps the new prompt handles one category better but struggles with another.
  • Statistical significance: With per-example paired data, statistical tests (e.g., McNemar's test for categorical outcomes) can determine whether the improvement is statistically meaningful or within noise.

The Iteration Cycle

Prompt improvement follows a hypothesis-driven cycle:

  1. Observe: Examine baseline experiment failures to identify error patterns
  2. Hypothesize: Formulate a prompt change expected to address the observed failures
  3. Test: Run the modified prompt against the same dataset
  4. Compare: Use the comparison tool to quantify the change
  5. Decide: Accept the change if it improves results without unacceptable regressions

This cycle is directly supported by the Ragas experiment and comparison infrastructure. Each experiment run produces a CSV with per-row scores, and the comparison function merges multiple CSVs by ID to enable side-by-side analysis.

Practical Guide

Running Experiments

Each prompt version is evaluated using the Ragas @experiment() decorator:

# Run baseline experiment
python evals.py run --prompt_file promptv1.txt --name baseline

# Analyze failures, modify prompt, then run comparison
python evals.py run --prompt_file promptv2.txt --name improved

Each run produces a CSV file in the experiments/ directory containing per-row results with columns such as id, response, labels_score, and priority_score.

Comparing Results

The comparison tool merges experiment CSVs by the shared id column:

python evals.py compare \
    --inputs experiments/baseline.csv experiments/improved.csv

This produces a combined CSV with columns organized as:

Column Description
id Shared sample identifier
text Original input text
expected_labels Ground truth labels
expected_priority Ground truth priority
baseline_response Response from prompt v1
baseline_labels_score Score from prompt v1
baseline_priority_score Priority score from prompt v1
improved_response Response from prompt v2
improved_labels_score Score from prompt v2
improved_priority_score Priority score from prompt v2

Interpreting Results

The comparison function also prints per-experiment accuracy summaries:

baseline Labels Accuracy: 60.00%
baseline Priority Accuracy: 70.00%
improved Labels Accuracy: 75.00%
improved Priority Accuracy: 80.00%

For deeper analysis, load the combined CSV into a DataFrame and compute per-example deltas to identify regressions and improvements.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment