Principle:Explodinggradients Ragas Prompt Iteration Comparison

Knowledge Sources	Domains	Last Updated
`examples/iterate_prompt/evals.py`, `examples/iterate_prompt/run_prompt.py`	Prompt Engineering, A/B Testing, Experiment Comparison	2026-02-10

Overview

Description

The Prompt Iteration Comparison principle establishes a systematic approach to A/B comparison of prompt versions using side-by-side experiment results. Rather than manually inspecting outputs or relying on subjective judgment, this principle requires that prompt improvement be driven by controlled experiments where identical datasets are processed through different prompt versions and the resulting scores are compared quantitatively. Alignment by a shared sample ID enables per-example analysis, revealing not just aggregate improvements but also which specific cases improved or regressed.

Usage

When iterating on a prompt, a practitioner:

Defines a baseline prompt and runs it through the Ragas experiment framework against a fixed evaluation dataset
Analyzes the baseline results to identify failure patterns and areas for improvement
Modifies the prompt (adjusting instructions, examples, output format, etc.)
Runs the modified prompt through the same experiment framework against the same dataset
Uses the comparison function to merge the two experiment result CSVs by sample ID, producing a combined view with per-experiment response and score columns
Examines aggregate accuracy changes and per-example score differences to determine whether the prompt change is an improvement

This workflow can be repeated iteratively, comparing each new version against the previous best, building a chain of evidence for prompt quality over time.

Theoretical Basis

Controlled Comparison

Prompt engineering is often treated as an art, but rigorous improvement requires the same controlled comparison methodology used in A/B testing:

Fixed dataset: Both prompt versions are evaluated against exactly the same inputs. If the dataset changes between runs, score differences cannot be attributed to the prompt change alone.
Fixed evaluation criteria: The same metrics (e.g., labels_exact_match, priority_accuracy) are applied to both versions. Changing the metric between runs invalidates the comparison.
Sample-level alignment: By requiring an id column in the dataset and aligning results by this ID, the comparison reveals per-example behavior changes. An aggregate accuracy improvement of 5% could mask the fact that the new prompt fixed 10 cases but broke 5 others -- per-example alignment surfaces this.

Why Per-Example Analysis Matters

Aggregate metrics (overall accuracy, mean score) hide important information:

Regressions: A prompt change that improves average performance may introduce regressions on specific cases. Per-example comparison catches these.
Failure clustering: By examining which specific examples changed from "incorrect" to "correct" (or vice versa), the practitioner can identify patterns -- perhaps the new prompt handles one category better but struggles with another.
Statistical significance: With per-example paired data, statistical tests (e.g., McNemar's test for categorical outcomes) can determine whether the improvement is statistically meaningful or within noise.

The Iteration Cycle

Prompt improvement follows a hypothesis-driven cycle:

Observe: Examine baseline experiment failures to identify error patterns
Hypothesize: Formulate a prompt change expected to address the observed failures
Test: Run the modified prompt against the same dataset
Compare: Use the comparison tool to quantify the change
Decide: Accept the change if it improves results without unacceptable regressions

This cycle is directly supported by the Ragas experiment and comparison infrastructure. Each experiment run produces a CSV with per-row scores, and the comparison function merges multiple CSVs by ID to enable side-by-side analysis.

Practical Guide

Running Experiments

Each prompt version is evaluated using the Ragas @experiment() decorator:

# Run baseline experiment
python evals.py run --prompt_file promptv1.txt --name baseline

# Analyze failures, modify prompt, then run comparison
python evals.py run --prompt_file promptv2.txt --name improved

Each run produces a CSV file in the experiments/ directory containing per-row results with columns such as id, response, labels_score, and priority_score.

Comparing Results

The comparison tool merges experiment CSVs by the shared id column:

python evals.py compare \
    --inputs experiments/baseline.csv experiments/improved.csv

This produces a combined CSV with columns organized as:

Column	Description
`id`	Shared sample identifier
`text`	Original input text
`expected_labels`	Ground truth labels
`expected_priority`	Ground truth priority
`baseline_response`	Response from prompt v1
`baseline_labels_score`	Score from prompt v1
`baseline_priority_score`	Priority score from prompt v1
`improved_response`	Response from prompt v2
`improved_labels_score`	Score from prompt v2
`improved_priority_score`	Priority score from prompt v2

Interpreting Results

The comparison function also prints per-experiment accuracy summaries:

baseline Labels Accuracy: 60.00%
baseline Priority Accuracy: 70.00%
improved Labels Accuracy: 75.00%
improved Priority Accuracy: 80.00%

For deeper analysis, load the combined CSV into a DataFrame and compute per-example deltas to identify regressions and improvements.

Related Pages

Implementation:Explodinggradients_Ragas_Experiment_Comparison_Pattern

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment