Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Explodinggradients Ragas Experiment Comparison Pattern

From Leeroopedia


Knowledge Sources Type Domains Last Updated
examples/iterate_prompt/evals.py, examples/iterate_prompt/run_prompt.py Pattern Doc (comparison workflow) Prompt Engineering, A/B Testing, Experiment Comparison 2026-02-10

Overview

Pattern for systematically comparing multiple prompt experiment results by merging DataFrames on a shared ID column. This pattern enables side-by-side analysis of how different prompt versions perform on identical inputs, producing a combined CSV with per-experiment response and score columns aligned by sample ID.

Description

The Experiment Comparison Pattern implements the complete workflow for prompt iteration:

  1. Run baseline experiment: Execute a prompt version against an evaluation dataset using the Ragas @experiment() framework, producing a CSV with per-row scores
  2. Analyze failures: Inspect baseline results to identify error patterns
  3. Modify prompt: Create a new prompt version addressing observed weaknesses
  4. Run comparison experiment: Execute the new prompt against the same dataset
  5. Merge by ID: Use the compare_inputs_to_output() function to align experiment CSVs by the shared id column, producing a combined view with per-experiment columns
  6. Compute score differences: Print per-experiment accuracy summaries and enable per-example delta analysis

The pattern enforces data integrity by validating that all input CSVs contain the same set of IDs (no missing or extra samples) and that no duplicate IDs exist within any single experiment.

Usage

The workflow is driven through a CLI with two subcommands:

# Step 1: Run baseline experiment
python evals.py run --prompt_file promptv1.txt --name baseline

# Step 2: Run improved experiment
python evals.py run --prompt_file promptv2.txt --name improved

# Step 3: Compare results
python evals.py compare \
    --inputs experiments/baseline.csv experiments/improved.csv \
    --output comparison.csv

Interface Specification

Experiment Function Interface

Each prompt version is evaluated through an @experiment()-decorated async function that accepts a dataset row and additional parameters:

from ragas import Dataset, experiment

@experiment()
async def support_triage_experiment(row, prompt_file: str, experiment_name: str):
    """
    Experiment function for evaluating a prompt version.

    Args:
        row: Dictionary from the Dataset containing at minimum:
            "id": str       -- Unique sample identifier (required for comparison)
            "text": str     -- Input text to evaluate
            ...             -- Ground truth columns for metric scoring

        prompt_file: str    -- Path to the prompt template file
        experiment_name: str -- Name identifying this experiment version

    Returns:
        Dictionary containing:
            "id": str                   -- Sample ID (preserved for alignment)
            "text": str                 -- Input text
            "response": str             -- Model's raw response
            "experiment_name": str      -- Name of this experiment
            "labels_score": str         -- Per-metric score ("correct"/"incorrect")
            "priority_score": str       -- Per-metric score ("correct"/"incorrect")
            ...                         -- Additional ground truth and prediction fields
    """
    ...

Comparison Function Interface

The comparison function merges multiple experiment CSVs:

from typing import List, Optional

def compare_inputs_to_output(
    inputs: List[str],
    output_path: Optional[str] = None,
) -> str:
    """
    Compare multiple experiment CSVs and write a combined CSV.

    Args:
        inputs: List of paths to experiment CSV files (minimum 2).
        output_path: Optional output path. Defaults to
                     experiments/<timestamp>-comparison.csv.

    Returns:
        The full path to the written comparison CSV.

    Raises:
        ValueError: If fewer than 2 inputs, missing columns, duplicate IDs,
                    or mismatched ID sets between inputs.
    """
    ...

Example Implementations

Metric Definitions

Source: examples/iterate_prompt/evals.py (lines 16-65)

The example defines two discrete metrics for evaluating a support ticket triage system:

from ragas.metrics import MetricResult, discrete_metric

@discrete_metric(name="labels_exact_match", allowed_values=["correct", "incorrect"])
def labels_exact_match(prediction: str, expected_labels: str):
    """Check if the predicted labels exactly match the expected labels."""
    try:
        parsed_json = json.loads(prediction)
        predicted_labels = parsed_json.get("labels", [])

        predicted_set = set(predicted_labels)
        expected_set = set(expected_labels.split(";")) if expected_labels else set()

        if predicted_set == expected_set:
            return MetricResult(
                value="correct",
                reason=f"Correctly predicted labels: {sorted(list(predicted_set))}",
            )
        else:
            return MetricResult(
                value="incorrect",
                reason=f"Expected labels: {sorted(list(expected_set))}; "
                       f"Got labels: {sorted(list(predicted_set))}",
            )
    except (json.JSONDecodeError, KeyError, TypeError) as e:
        return MetricResult(
            value="incorrect",
            reason=f"Failed to parse labels from response: {str(e)}",
        )


@discrete_metric(name="priority_accuracy", allowed_values=["correct", "incorrect"])
def priority_accuracy(prediction: str, expected_priority: str):
    """Check if the predicted priority matches the expected priority."""
    try:
        parsed_json = json.loads(prediction)
        predicted_priority = parsed_json.get("priority")

        if predicted_priority == expected_priority:
            return MetricResult(
                value="correct",
                reason=f"Correctly predicted priority: {expected_priority}",
            )
        else:
            return MetricResult(
                value="incorrect",
                reason=f"Expected priority: {expected_priority}; "
                       f"Got priority: {predicted_priority}",
            )
    except (json.JSONDecodeError, KeyError, TypeError) as e:
        return MetricResult(
            value="incorrect",
            reason=f"Failed to parse priority from response: {str(e)}",
        )

Experiment Function

Source: examples/iterate_prompt/evals.py (lines 68-105)

@experiment()
async def support_triage_experiment(row, prompt_file: str, experiment_name: str):
    """Experiment function for support triage evaluation."""
    # Get model response using the specified prompt file
    response = await run_prompt(row["text"], prompt_file=prompt_file)

    # Parse response to extract predicted values
    try:
        parsed_json = json.loads(response)
        predicted_labels = parsed_json.get("labels", [])
        predicted_priority = parsed_json.get("priority")
        predicted_labels_str = (
            ";".join(predicted_labels) if predicted_labels else ""
        )
    except Exception:
        predicted_labels_str = ""
        predicted_priority = None

    # Score the response using both metrics
    labels_score = labels_exact_match.score(
        prediction=response, expected_labels=row["labels"]
    )
    priority_score = priority_accuracy.score(
        prediction=response, expected_priority=row["priority"]
    )

    return {
        "id": row["id"],
        "text": row["text"],
        "response": response,
        "experiment_name": experiment_name,
        "expected_labels": row["labels"],
        "predicted_labels": predicted_labels_str,
        "expected_priority": row["priority"],
        "predicted_priority": predicted_priority,
        "labels_score": labels_score.value,
        "priority_score": priority_score.value,
    }

Prompt Execution (run_prompt)

Source: examples/iterate_prompt/run_prompt.py (lines 8-32)

The prompt is loaded from a file and executed via the OpenAI API with JSON output format:

from openai import AsyncOpenAI

client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])


def load_prompt(prompt_file: str) -> str:
    """Load prompt from a text file."""
    with open(prompt_file, "r") as f:
        return f.read().strip()


async def run_prompt(ticket_text: str, prompt_file: str = "promptv1.txt"):
    """Run the prompt against a customer support ticket."""
    system_prompt = load_prompt(prompt_file)
    user_message = f'Ticket: "{ticket_text}"'

    response = await client.chat.completions.create(
        model="gpt-5-mini-2025-08-07",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ],
    )
    response = (
        response.choices[0].message.content.strip()
        if response.choices[0].message.content
        else ""
    )
    return response

Comparison Logic (Core Pattern)

Source: examples/iterate_prompt/evals.py (lines 142-247)

This is the central implementation that merges experiment results by ID:

def compare_inputs_to_output(
    inputs: List[str], output_path: Optional[str] = None
) -> str:
    """Compare multiple experiment CSVs and write a combined CSV.

    - Requires 'id' column in all inputs; uses it as the alignment key
    - Builds output with id + canonical columns + per-experiment
      response/score columns
    - Returns the full output path
    """
    if not inputs or len(inputs) < 2:
        raise ValueError(
            "At least two input CSV files are required for comparison"
        )

    # Load all inputs and extract experiment names
    dataframes = []
    experiment_names = []
    for path in inputs:
        df = pd.read_csv(path)
        if "experiment_name" not in df.columns:
            raise ValueError(f"Missing 'experiment_name' column in {path}")
        exp_name = str(df["experiment_name"].iloc[0])
        experiment_names.append(exp_name)
        dataframes.append(df)

    canonical_cols = ["text", "expected_labels", "expected_priority"]
    base_df = dataframes[0]

    # Require 'id' in all inputs
    if not all("id" in df.columns for df in dataframes):
        raise ValueError(
            "All input CSVs must contain an 'id' column to align rows."
        )

    # Validate: no duplicate IDs within any input
    key_sets = []
    for idx, df in enumerate(dataframes):
        keys = df["id"].astype(str)
        if keys.duplicated().any():
            dupes = keys[keys.duplicated()].head(3).tolist()
            raise ValueError(
                f"Input {inputs[idx]} contains duplicate id values. "
                f"Examples: {dupes}"
            )
        key_sets.append(set(keys.tolist()))

    # Validate: all inputs have the same set of IDs
    base_keys = key_sets[0]
    for i, ks in enumerate(key_sets[1:], start=1):
        if ks != base_keys:
            missing_in_other = list(base_keys - ks)[:5]
            missing_in_base = list(ks - base_keys)[:5]
            raise ValueError(
                "Inputs do not contain the same set of IDs.\n"
                f"- Missing in file {i + 1}: {missing_in_other}\n"
                f"- Extra in file {i + 1}: {missing_in_base}"
            )

    # Build combined DataFrame using 'id' as alignment key
    base_ids_str = base_df["id"].astype(str)
    combined = base_df[["id"] + canonical_cols].copy()

    # Append per-experiment columns by aligned ID
    for df, exp_name in zip(dataframes, experiment_names):
        df = df.copy()
        df["id"] = df["id"].astype(str)
        df = df.set_index("id")
        for col in ["response", "labels_score", "priority_score"]:
            if col not in df.columns:
                raise ValueError(
                    f"Column '{col}' not found in one input."
                )
        combined[f"{exp_name}_response"] = base_ids_str.map(df["response"])
        combined[f"{exp_name}_labels_score"] = base_ids_str.map(
            df["labels_score"]
        )
        combined[f"{exp_name}_priority_score"] = base_ids_str.map(
            df["priority_score"]
        )

    # Write output
    if output_path is None or output_path.strip() == "":
        run_id = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
        output_path = os.path.join(experiments_dir, f"{run_id}-comparison.csv")

    combined = combined.sort_values(by="id").reset_index(drop=True)
    combined.to_csv(output_path, index=False)

    # Print per-experiment accuracy summary
    for df, exp_name in zip(dataframes, experiment_names):
        try:
            labels_acc = (df["labels_score"] == "correct").mean()
            priority_acc = (df["priority_score"] == "correct").mean()
            print(f"{exp_name} Labels Accuracy: {labels_acc:.2%}")
            print(f"{exp_name} Priority Accuracy: {priority_acc:.2%}")
        except Exception:
            pass

    return output_path

CLI Integration

Source: examples/iterate_prompt/evals.py (lines 296-341)

The CLI provides two subcommands for running experiments and comparing results:

def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(
        description="Support Triage Prompt Evaluation CLI"
    )
    subparsers = parser.add_subparsers(dest="command", required=True)

    # run subcommand
    run_parser = subparsers.add_parser("run", help="Run a single experiment")
    run_parser.add_argument(
        "--prompt_file", type=str, required=True,
        help="Prompt file to evaluate",
    )
    run_parser.add_argument(
        "--name", type=str, default=None,
        help="Experiment name (defaults to prompt filename)",
    )

    # compare subcommand
    cmp_parser = subparsers.add_parser(
        "compare", help="Combine multiple experiment CSVs"
    )
    cmp_parser.add_argument(
        "--inputs", nargs="+", required=True,
        help="Input CSV files to compare",
    )
    cmp_parser.add_argument(
        "--output", type=str, default=None,
        help="Output CSV path (defaults to experiments/<timestamp>-comparison.csv)",
    )

    return parser


if __name__ == "__main__":
    parser = build_parser()
    args = parser.parse_args()

    if args.command == "run":
        asyncio.run(run_command(prompt_file=args.prompt_file, name=args.name))
    elif args.command == "compare":
        compare_command(inputs=args.inputs, output=args.output)

Key Observations

  • ID-based alignment is mandatory: The comparison function raises a ValueError if any input CSV lacks an id column. This enforces the principle that per-example alignment is essential for meaningful comparison.
  • Strict ID set matching: All experiment CSVs must contain exactly the same set of IDs. This prevents misleading comparisons where experiments were run on different subsets of data.
  • Duplicate detection: The function catches duplicate IDs within a single experiment, preventing data integrity issues that would silently corrupt the merged output.
  • Experiment name extraction: The experiment_name column in each CSV is used to prefix output columns (e.g., baseline_labels_score, improved_labels_score), creating a self-documenting comparison table.
  • Scalable to N experiments: While the typical use case is comparing two prompt versions, the compare_inputs_to_output() function accepts any number of input CSVs, enabling multi-version comparison across an entire prompt iteration history.
  • Aggregate and per-example analysis: The function provides both a summary printout (per-experiment accuracy percentages) and the detailed combined CSV for per-example drill-down.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment