Implementation:Explodinggradients Ragas Experiment Comparison Pattern
| Knowledge Sources | Type | Domains | Last Updated |
|---|---|---|---|
examples/iterate_prompt/evals.py, examples/iterate_prompt/run_prompt.py |
Pattern Doc (comparison workflow) | Prompt Engineering, A/B Testing, Experiment Comparison | 2026-02-10 |
Overview
Pattern for systematically comparing multiple prompt experiment results by merging DataFrames on a shared ID column. This pattern enables side-by-side analysis of how different prompt versions perform on identical inputs, producing a combined CSV with per-experiment response and score columns aligned by sample ID.
Description
The Experiment Comparison Pattern implements the complete workflow for prompt iteration:
- Run baseline experiment: Execute a prompt version against an evaluation dataset using the Ragas
@experiment()framework, producing a CSV with per-row scores - Analyze failures: Inspect baseline results to identify error patterns
- Modify prompt: Create a new prompt version addressing observed weaknesses
- Run comparison experiment: Execute the new prompt against the same dataset
- Merge by ID: Use the
compare_inputs_to_output()function to align experiment CSVs by the sharedidcolumn, producing a combined view with per-experiment columns - Compute score differences: Print per-experiment accuracy summaries and enable per-example delta analysis
The pattern enforces data integrity by validating that all input CSVs contain the same set of IDs (no missing or extra samples) and that no duplicate IDs exist within any single experiment.
Usage
The workflow is driven through a CLI with two subcommands:
# Step 1: Run baseline experiment
python evals.py run --prompt_file promptv1.txt --name baseline
# Step 2: Run improved experiment
python evals.py run --prompt_file promptv2.txt --name improved
# Step 3: Compare results
python evals.py compare \
--inputs experiments/baseline.csv experiments/improved.csv \
--output comparison.csv
Interface Specification
Experiment Function Interface
Each prompt version is evaluated through an @experiment()-decorated async function that accepts a dataset row and additional parameters:
from ragas import Dataset, experiment
@experiment()
async def support_triage_experiment(row, prompt_file: str, experiment_name: str):
"""
Experiment function for evaluating a prompt version.
Args:
row: Dictionary from the Dataset containing at minimum:
"id": str -- Unique sample identifier (required for comparison)
"text": str -- Input text to evaluate
... -- Ground truth columns for metric scoring
prompt_file: str -- Path to the prompt template file
experiment_name: str -- Name identifying this experiment version
Returns:
Dictionary containing:
"id": str -- Sample ID (preserved for alignment)
"text": str -- Input text
"response": str -- Model's raw response
"experiment_name": str -- Name of this experiment
"labels_score": str -- Per-metric score ("correct"/"incorrect")
"priority_score": str -- Per-metric score ("correct"/"incorrect")
... -- Additional ground truth and prediction fields
"""
...
Comparison Function Interface
The comparison function merges multiple experiment CSVs:
from typing import List, Optional
def compare_inputs_to_output(
inputs: List[str],
output_path: Optional[str] = None,
) -> str:
"""
Compare multiple experiment CSVs and write a combined CSV.
Args:
inputs: List of paths to experiment CSV files (minimum 2).
output_path: Optional output path. Defaults to
experiments/<timestamp>-comparison.csv.
Returns:
The full path to the written comparison CSV.
Raises:
ValueError: If fewer than 2 inputs, missing columns, duplicate IDs,
or mismatched ID sets between inputs.
"""
...
Example Implementations
Metric Definitions
Source: examples/iterate_prompt/evals.py (lines 16-65)
The example defines two discrete metrics for evaluating a support ticket triage system:
from ragas.metrics import MetricResult, discrete_metric
@discrete_metric(name="labels_exact_match", allowed_values=["correct", "incorrect"])
def labels_exact_match(prediction: str, expected_labels: str):
"""Check if the predicted labels exactly match the expected labels."""
try:
parsed_json = json.loads(prediction)
predicted_labels = parsed_json.get("labels", [])
predicted_set = set(predicted_labels)
expected_set = set(expected_labels.split(";")) if expected_labels else set()
if predicted_set == expected_set:
return MetricResult(
value="correct",
reason=f"Correctly predicted labels: {sorted(list(predicted_set))}",
)
else:
return MetricResult(
value="incorrect",
reason=f"Expected labels: {sorted(list(expected_set))}; "
f"Got labels: {sorted(list(predicted_set))}",
)
except (json.JSONDecodeError, KeyError, TypeError) as e:
return MetricResult(
value="incorrect",
reason=f"Failed to parse labels from response: {str(e)}",
)
@discrete_metric(name="priority_accuracy", allowed_values=["correct", "incorrect"])
def priority_accuracy(prediction: str, expected_priority: str):
"""Check if the predicted priority matches the expected priority."""
try:
parsed_json = json.loads(prediction)
predicted_priority = parsed_json.get("priority")
if predicted_priority == expected_priority:
return MetricResult(
value="correct",
reason=f"Correctly predicted priority: {expected_priority}",
)
else:
return MetricResult(
value="incorrect",
reason=f"Expected priority: {expected_priority}; "
f"Got priority: {predicted_priority}",
)
except (json.JSONDecodeError, KeyError, TypeError) as e:
return MetricResult(
value="incorrect",
reason=f"Failed to parse priority from response: {str(e)}",
)
Experiment Function
Source: examples/iterate_prompt/evals.py (lines 68-105)
@experiment()
async def support_triage_experiment(row, prompt_file: str, experiment_name: str):
"""Experiment function for support triage evaluation."""
# Get model response using the specified prompt file
response = await run_prompt(row["text"], prompt_file=prompt_file)
# Parse response to extract predicted values
try:
parsed_json = json.loads(response)
predicted_labels = parsed_json.get("labels", [])
predicted_priority = parsed_json.get("priority")
predicted_labels_str = (
";".join(predicted_labels) if predicted_labels else ""
)
except Exception:
predicted_labels_str = ""
predicted_priority = None
# Score the response using both metrics
labels_score = labels_exact_match.score(
prediction=response, expected_labels=row["labels"]
)
priority_score = priority_accuracy.score(
prediction=response, expected_priority=row["priority"]
)
return {
"id": row["id"],
"text": row["text"],
"response": response,
"experiment_name": experiment_name,
"expected_labels": row["labels"],
"predicted_labels": predicted_labels_str,
"expected_priority": row["priority"],
"predicted_priority": predicted_priority,
"labels_score": labels_score.value,
"priority_score": priority_score.value,
}
Prompt Execution (run_prompt)
Source: examples/iterate_prompt/run_prompt.py (lines 8-32)
The prompt is loaded from a file and executed via the OpenAI API with JSON output format:
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
def load_prompt(prompt_file: str) -> str:
"""Load prompt from a text file."""
with open(prompt_file, "r") as f:
return f.read().strip()
async def run_prompt(ticket_text: str, prompt_file: str = "promptv1.txt"):
"""Run the prompt against a customer support ticket."""
system_prompt = load_prompt(prompt_file)
user_message = f'Ticket: "{ticket_text}"'
response = await client.chat.completions.create(
model="gpt-5-mini-2025-08-07",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
)
response = (
response.choices[0].message.content.strip()
if response.choices[0].message.content
else ""
)
return response
Comparison Logic (Core Pattern)
Source: examples/iterate_prompt/evals.py (lines 142-247)
This is the central implementation that merges experiment results by ID:
def compare_inputs_to_output(
inputs: List[str], output_path: Optional[str] = None
) -> str:
"""Compare multiple experiment CSVs and write a combined CSV.
- Requires 'id' column in all inputs; uses it as the alignment key
- Builds output with id + canonical columns + per-experiment
response/score columns
- Returns the full output path
"""
if not inputs or len(inputs) < 2:
raise ValueError(
"At least two input CSV files are required for comparison"
)
# Load all inputs and extract experiment names
dataframes = []
experiment_names = []
for path in inputs:
df = pd.read_csv(path)
if "experiment_name" not in df.columns:
raise ValueError(f"Missing 'experiment_name' column in {path}")
exp_name = str(df["experiment_name"].iloc[0])
experiment_names.append(exp_name)
dataframes.append(df)
canonical_cols = ["text", "expected_labels", "expected_priority"]
base_df = dataframes[0]
# Require 'id' in all inputs
if not all("id" in df.columns for df in dataframes):
raise ValueError(
"All input CSVs must contain an 'id' column to align rows."
)
# Validate: no duplicate IDs within any input
key_sets = []
for idx, df in enumerate(dataframes):
keys = df["id"].astype(str)
if keys.duplicated().any():
dupes = keys[keys.duplicated()].head(3).tolist()
raise ValueError(
f"Input {inputs[idx]} contains duplicate id values. "
f"Examples: {dupes}"
)
key_sets.append(set(keys.tolist()))
# Validate: all inputs have the same set of IDs
base_keys = key_sets[0]
for i, ks in enumerate(key_sets[1:], start=1):
if ks != base_keys:
missing_in_other = list(base_keys - ks)[:5]
missing_in_base = list(ks - base_keys)[:5]
raise ValueError(
"Inputs do not contain the same set of IDs.\n"
f"- Missing in file {i + 1}: {missing_in_other}\n"
f"- Extra in file {i + 1}: {missing_in_base}"
)
# Build combined DataFrame using 'id' as alignment key
base_ids_str = base_df["id"].astype(str)
combined = base_df[["id"] + canonical_cols].copy()
# Append per-experiment columns by aligned ID
for df, exp_name in zip(dataframes, experiment_names):
df = df.copy()
df["id"] = df["id"].astype(str)
df = df.set_index("id")
for col in ["response", "labels_score", "priority_score"]:
if col not in df.columns:
raise ValueError(
f"Column '{col}' not found in one input."
)
combined[f"{exp_name}_response"] = base_ids_str.map(df["response"])
combined[f"{exp_name}_labels_score"] = base_ids_str.map(
df["labels_score"]
)
combined[f"{exp_name}_priority_score"] = base_ids_str.map(
df["priority_score"]
)
# Write output
if output_path is None or output_path.strip() == "":
run_id = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
output_path = os.path.join(experiments_dir, f"{run_id}-comparison.csv")
combined = combined.sort_values(by="id").reset_index(drop=True)
combined.to_csv(output_path, index=False)
# Print per-experiment accuracy summary
for df, exp_name in zip(dataframes, experiment_names):
try:
labels_acc = (df["labels_score"] == "correct").mean()
priority_acc = (df["priority_score"] == "correct").mean()
print(f"{exp_name} Labels Accuracy: {labels_acc:.2%}")
print(f"{exp_name} Priority Accuracy: {priority_acc:.2%}")
except Exception:
pass
return output_path
CLI Integration
Source: examples/iterate_prompt/evals.py (lines 296-341)
The CLI provides two subcommands for running experiments and comparing results:
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Support Triage Prompt Evaluation CLI"
)
subparsers = parser.add_subparsers(dest="command", required=True)
# run subcommand
run_parser = subparsers.add_parser("run", help="Run a single experiment")
run_parser.add_argument(
"--prompt_file", type=str, required=True,
help="Prompt file to evaluate",
)
run_parser.add_argument(
"--name", type=str, default=None,
help="Experiment name (defaults to prompt filename)",
)
# compare subcommand
cmp_parser = subparsers.add_parser(
"compare", help="Combine multiple experiment CSVs"
)
cmp_parser.add_argument(
"--inputs", nargs="+", required=True,
help="Input CSV files to compare",
)
cmp_parser.add_argument(
"--output", type=str, default=None,
help="Output CSV path (defaults to experiments/<timestamp>-comparison.csv)",
)
return parser
if __name__ == "__main__":
parser = build_parser()
args = parser.parse_args()
if args.command == "run":
asyncio.run(run_command(prompt_file=args.prompt_file, name=args.name))
elif args.command == "compare":
compare_command(inputs=args.inputs, output=args.output)
Key Observations
- ID-based alignment is mandatory: The comparison function raises a
ValueErrorif any input CSV lacks anidcolumn. This enforces the principle that per-example alignment is essential for meaningful comparison. - Strict ID set matching: All experiment CSVs must contain exactly the same set of IDs. This prevents misleading comparisons where experiments were run on different subsets of data.
- Duplicate detection: The function catches duplicate IDs within a single experiment, preventing data integrity issues that would silently corrupt the merged output.
- Experiment name extraction: The
experiment_namecolumn in each CSV is used to prefix output columns (e.g.,baseline_labels_score,improved_labels_score), creating a self-documenting comparison table. - Scalable to N experiments: While the typical use case is comparing two prompt versions, the
compare_inputs_to_output()function accepts any number of input CSVs, enabling multi-version comparison across an entire prompt iteration history. - Aggregate and per-example analysis: The function provides both a summary printout (per-experiment accuracy percentages) and the detailed combined CSV for per-example drill-down.