Principle:Explodinggradients Ragas Prompt Iteration Comparison
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
examples/iterate_prompt/evals.py, examples/iterate_prompt/run_prompt.py |
Prompt Engineering, A/B Testing, Experiment Comparison | 2026-02-10 |
Overview
Description
The Prompt Iteration Comparison principle establishes a systematic approach to A/B comparison of prompt versions using side-by-side experiment results. Rather than manually inspecting outputs or relying on subjective judgment, this principle requires that prompt improvement be driven by controlled experiments where identical datasets are processed through different prompt versions and the resulting scores are compared quantitatively. Alignment by a shared sample ID enables per-example analysis, revealing not just aggregate improvements but also which specific cases improved or regressed.
Usage
When iterating on a prompt, a practitioner:
- Defines a baseline prompt and runs it through the Ragas experiment framework against a fixed evaluation dataset
- Analyzes the baseline results to identify failure patterns and areas for improvement
- Modifies the prompt (adjusting instructions, examples, output format, etc.)
- Runs the modified prompt through the same experiment framework against the same dataset
- Uses the comparison function to merge the two experiment result CSVs by sample ID, producing a combined view with per-experiment response and score columns
- Examines aggregate accuracy changes and per-example score differences to determine whether the prompt change is an improvement
This workflow can be repeated iteratively, comparing each new version against the previous best, building a chain of evidence for prompt quality over time.
Theoretical Basis
Controlled Comparison
Prompt engineering is often treated as an art, but rigorous improvement requires the same controlled comparison methodology used in A/B testing:
- Fixed dataset: Both prompt versions are evaluated against exactly the same inputs. If the dataset changes between runs, score differences cannot be attributed to the prompt change alone.
- Fixed evaluation criteria: The same metrics (e.g.,
labels_exact_match,priority_accuracy) are applied to both versions. Changing the metric between runs invalidates the comparison. - Sample-level alignment: By requiring an
idcolumn in the dataset and aligning results by this ID, the comparison reveals per-example behavior changes. An aggregate accuracy improvement of 5% could mask the fact that the new prompt fixed 10 cases but broke 5 others -- per-example alignment surfaces this.
Why Per-Example Analysis Matters
Aggregate metrics (overall accuracy, mean score) hide important information:
- Regressions: A prompt change that improves average performance may introduce regressions on specific cases. Per-example comparison catches these.
- Failure clustering: By examining which specific examples changed from "incorrect" to "correct" (or vice versa), the practitioner can identify patterns -- perhaps the new prompt handles one category better but struggles with another.
- Statistical significance: With per-example paired data, statistical tests (e.g., McNemar's test for categorical outcomes) can determine whether the improvement is statistically meaningful or within noise.
The Iteration Cycle
Prompt improvement follows a hypothesis-driven cycle:
- Observe: Examine baseline experiment failures to identify error patterns
- Hypothesize: Formulate a prompt change expected to address the observed failures
- Test: Run the modified prompt against the same dataset
- Compare: Use the comparison tool to quantify the change
- Decide: Accept the change if it improves results without unacceptable regressions
This cycle is directly supported by the Ragas experiment and comparison infrastructure. Each experiment run produces a CSV with per-row scores, and the comparison function merges multiple CSVs by ID to enable side-by-side analysis.
Practical Guide
Running Experiments
Each prompt version is evaluated using the Ragas @experiment() decorator:
# Run baseline experiment
python evals.py run --prompt_file promptv1.txt --name baseline
# Analyze failures, modify prompt, then run comparison
python evals.py run --prompt_file promptv2.txt --name improved
Each run produces a CSV file in the experiments/ directory containing per-row results with columns such as id, response, labels_score, and priority_score.
Comparing Results
The comparison tool merges experiment CSVs by the shared id column:
python evals.py compare \
--inputs experiments/baseline.csv experiments/improved.csv
This produces a combined CSV with columns organized as:
| Column | Description |
|---|---|
id |
Shared sample identifier |
text |
Original input text |
expected_labels |
Ground truth labels |
expected_priority |
Ground truth priority |
baseline_response |
Response from prompt v1 |
baseline_labels_score |
Score from prompt v1 |
baseline_priority_score |
Priority score from prompt v1 |
improved_response |
Response from prompt v2 |
improved_labels_score |
Score from prompt v2 |
improved_priority_score |
Priority score from prompt v2 |
Interpreting Results
The comparison function also prints per-experiment accuracy summaries:
baseline Labels Accuracy: 60.00%
baseline Priority Accuracy: 70.00%
improved Labels Accuracy: 75.00%
improved Priority Accuracy: 80.00%
For deeper analysis, load the combined CSV into a DataFrame and compute per-example deltas to identify regressions and improvements.