Overview
This script provides a CLI-based evaluation harness for iterating on prompt files by running support triage experiments with Ragas metrics and comparing results across multiple prompt variants.
Description
The evals.py module implements an end-to-end prompt evaluation workflow for a support triage use case. It defines two discrete metrics -- labels_exact_match and priority_accuracy -- that assess whether an LLM's JSON-structured predictions match expected label sets and priority values. The module leverages the Ragas @experiment decorator to orchestrate asynchronous evaluation runs over a dataset loaded from CSV.
The script supports two CLI subcommands:
- run -- Executes a single experiment against a specified prompt file. It loads the support triage dataset, invokes the prompt via the imported run_prompt function, scores each prediction using the two discrete metrics, and saves per-row results (including response text, predicted/expected values, and metric scores) to a timestamped CSV file under an experiments/ directory.
- compare -- Combines multiple experiment CSV files into a single comparison CSV. It aligns rows by id, validates that all inputs share the same ID set with no duplicates, and appends per-experiment columns for response text and metric scores. It also prints per-experiment accuracy summaries for both labels and priority.
Key implementation details include order-independent label comparison via set equality, JSON parse error handling that defaults to "incorrect" scores, and timestamped file naming for both individual runs and comparison outputs.
Usage
Import this module or run it as a CLI script when you need to evaluate and compare different prompt files for a support triage classification task. Use the run subcommand to execute a single prompt evaluation, or compare to aggregate results from multiple runs into a side-by-side comparison.
Code Reference
Source Location
Signature
@discrete_metric(name="labels_exact_match", allowed_values=["correct", "incorrect"])
def labels_exact_match(prediction: str, expected_labels: str) -> MetricResult
@discrete_metric(name="priority_accuracy", allowed_values=["correct", "incorrect"])
def priority_accuracy(prediction: str, expected_priority: str) -> MetricResult
@experiment()
async def support_triage_experiment(row, prompt_file: str, experiment_name: str) -> dict
def load_dataset() -> Dataset
def compare_inputs_to_output(inputs: List[str], output_path: Optional[str] = None) -> str
async def run_command(prompt_file: str, name: Optional[str]) -> None
def compare_command(inputs: List[str], output: Optional[str]) -> None
def build_parser() -> argparse.ArgumentParser
Import
from examples.iterate_prompt.evals import (
labels_exact_match,
priority_accuracy,
support_triage_experiment,
load_dataset,
compare_inputs_to_output,
)
I/O Contract
Inputs
labels_exact_match
| Name |
Type |
Required |
Description
|
| prediction |
str |
Yes |
JSON string containing a "labels" key with a list of predicted label strings
|
| expected_labels |
str |
Yes |
Semicolon-separated string of expected labels
|
priority_accuracy
| Name |
Type |
Required |
Description
|
| prediction |
str |
Yes |
JSON string containing a "priority" key with the predicted priority value
|
| expected_priority |
str |
Yes |
The expected priority string to compare against
|
support_triage_experiment
| Name |
Type |
Required |
Description
|
| row |
dict |
Yes |
A dataset row containing "id", "text", "labels", and "priority" fields
|
| prompt_file |
str |
Yes |
Path to the prompt file to use for the LLM call
|
| experiment_name |
str |
Yes |
Name identifier for the experiment run
|
compare_inputs_to_output
| Name |
Type |
Required |
Description
|
| inputs |
List[str] |
Yes |
List of at least two CSV file paths from experiment runs to compare
|
| output_path |
Optional[str] |
No |
Output CSV path; defaults to a timestamped file under experiments/
|
Outputs
labels_exact_match / priority_accuracy
| Name |
Type |
Description
|
| return |
MetricResult |
Object with value ("correct" or "incorrect") and a reason string
|
support_triage_experiment
| Name |
Type |
Description
|
| return |
dict |
Dictionary with id, text, response, experiment_name, expected/predicted labels and priority, plus labels_score and priority_score
|
compare_inputs_to_output
| Name |
Type |
Description
|
| return |
str |
Full file path of the generated comparison CSV
|
Usage Examples
Running an Experiment via CLI
# Run a single prompt evaluation
python evals.py run --prompt_file prompts/v1.txt --name baseline
# Compare two experiment results
python evals.py compare --inputs experiments/run1.csv experiments/run2.csv --output comparison.csv
Programmatic Usage
import asyncio
from evals import load_dataset, support_triage_experiment, compare_inputs_to_output
# Load the support triage dataset
dataset = load_dataset()
# Run an experiment asynchronously
results = asyncio.run(
support_triage_experiment.arun(
dataset,
name="my-experiment",
prompt_file="prompts/v1.txt",
experiment_name="baseline",
)
)
# Compare multiple experiment outputs
output_path = compare_inputs_to_output(
inputs=["experiments/baseline.csv", "experiments/improved.csv"]
)
print(f"Comparison saved to: {output_path}")
Related Pages