Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas Iterate Prompt Evals

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Prompt Iteration, Support Triage
Last Updated 2026-02-12 00:00 GMT

Overview

This script provides a CLI-based evaluation harness for iterating on prompt files by running support triage experiments with Ragas metrics and comparing results across multiple prompt variants.

Description

The evals.py module implements an end-to-end prompt evaluation workflow for a support triage use case. It defines two discrete metrics -- labels_exact_match and priority_accuracy -- that assess whether an LLM's JSON-structured predictions match expected label sets and priority values. The module leverages the Ragas @experiment decorator to orchestrate asynchronous evaluation runs over a dataset loaded from CSV.

The script supports two CLI subcommands:

  • run -- Executes a single experiment against a specified prompt file. It loads the support triage dataset, invokes the prompt via the imported run_prompt function, scores each prediction using the two discrete metrics, and saves per-row results (including response text, predicted/expected values, and metric scores) to a timestamped CSV file under an experiments/ directory.
  • compare -- Combines multiple experiment CSV files into a single comparison CSV. It aligns rows by id, validates that all inputs share the same ID set with no duplicates, and appends per-experiment columns for response text and metric scores. It also prints per-experiment accuracy summaries for both labels and priority.

Key implementation details include order-independent label comparison via set equality, JSON parse error handling that defaults to "incorrect" scores, and timestamped file naming for both individual runs and comparison outputs.

Usage

Import this module or run it as a CLI script when you need to evaluate and compare different prompt files for a support triage classification task. Use the run subcommand to execute a single prompt evaluation, or compare to aggregate results from multiple runs into a side-by-side comparison.

Code Reference

Source Location

Signature

@discrete_metric(name="labels_exact_match", allowed_values=["correct", "incorrect"])
def labels_exact_match(prediction: str, expected_labels: str) -> MetricResult

@discrete_metric(name="priority_accuracy", allowed_values=["correct", "incorrect"])
def priority_accuracy(prediction: str, expected_priority: str) -> MetricResult

@experiment()
async def support_triage_experiment(row, prompt_file: str, experiment_name: str) -> dict

def load_dataset() -> Dataset

def compare_inputs_to_output(inputs: List[str], output_path: Optional[str] = None) -> str

async def run_command(prompt_file: str, name: Optional[str]) -> None

def compare_command(inputs: List[str], output: Optional[str]) -> None

def build_parser() -> argparse.ArgumentParser

Import

from examples.iterate_prompt.evals import (
    labels_exact_match,
    priority_accuracy,
    support_triage_experiment,
    load_dataset,
    compare_inputs_to_output,
)

I/O Contract

Inputs

labels_exact_match

Name Type Required Description
prediction str Yes JSON string containing a "labels" key with a list of predicted label strings
expected_labels str Yes Semicolon-separated string of expected labels

priority_accuracy

Name Type Required Description
prediction str Yes JSON string containing a "priority" key with the predicted priority value
expected_priority str Yes The expected priority string to compare against

support_triage_experiment

Name Type Required Description
row dict Yes A dataset row containing "id", "text", "labels", and "priority" fields
prompt_file str Yes Path to the prompt file to use for the LLM call
experiment_name str Yes Name identifier for the experiment run

compare_inputs_to_output

Name Type Required Description
inputs List[str] Yes List of at least two CSV file paths from experiment runs to compare
output_path Optional[str] No Output CSV path; defaults to a timestamped file under experiments/

Outputs

labels_exact_match / priority_accuracy

Name Type Description
return MetricResult Object with value ("correct" or "incorrect") and a reason string

support_triage_experiment

Name Type Description
return dict Dictionary with id, text, response, experiment_name, expected/predicted labels and priority, plus labels_score and priority_score

compare_inputs_to_output

Name Type Description
return str Full file path of the generated comparison CSV

Usage Examples

Running an Experiment via CLI

# Run a single prompt evaluation
python evals.py run --prompt_file prompts/v1.txt --name baseline

# Compare two experiment results
python evals.py compare --inputs experiments/run1.csv experiments/run2.csv --output comparison.csv

Programmatic Usage

import asyncio
from evals import load_dataset, support_triage_experiment, compare_inputs_to_output

# Load the support triage dataset
dataset = load_dataset()

# Run an experiment asynchronously
results = asyncio.run(
    support_triage_experiment.arun(
        dataset,
        name="my-experiment",
        prompt_file="prompts/v1.txt",
        experiment_name="baseline",
    )
)

# Compare multiple experiment outputs
output_path = compare_inputs_to_output(
    inputs=["experiments/baseline.csv", "experiments/improved.csv"]
)
print(f"Comparison saved to: {output_path}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment