Implementation:Vibrantlabsai Ragas Iterate Prompt Evals

Knowledge Sources	Vibrantlabsai_Ragas
Domains	LLM Evaluation, Prompt Iteration, Support Triage
Last Updated	2026-02-12 00:00 GMT

Overview

This script provides a CLI-based evaluation harness for iterating on prompt files by running support triage experiments with Ragas metrics and comparing results across multiple prompt variants.

Description

The evals.py module implements an end-to-end prompt evaluation workflow for a support triage use case. It defines two discrete metrics -- labels_exact_match and priority_accuracy -- that assess whether an LLM's JSON-structured predictions match expected label sets and priority values. The module leverages the Ragas @experiment decorator to orchestrate asynchronous evaluation runs over a dataset loaded from CSV.

The script supports two CLI subcommands:

run -- Executes a single experiment against a specified prompt file. It loads the support triage dataset, invokes the prompt via the imported run_prompt function, scores each prediction using the two discrete metrics, and saves per-row results (including response text, predicted/expected values, and metric scores) to a timestamped CSV file under an experiments/ directory.

compare -- Combines multiple experiment CSV files into a single comparison CSV. It aligns rows by id, validates that all inputs share the same ID set with no duplicates, and appends per-experiment columns for response text and metric scores. It also prints per-experiment accuracy summaries for both labels and priority.

Key implementation details include order-independent label comparison via set equality, JSON parse error handling that defaults to "incorrect" scores, and timestamped file naming for both individual runs and comparison outputs.

Usage

Import this module or run it as a CLI script when you need to evaluate and compare different prompt files for a support triage classification task. Use the run subcommand to execute a single prompt evaluation, or compare to aggregate results from multiple runs into a side-by-side comparison.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: examples/iterate_prompt/evals.py

Signature

@discrete_metric(name="labels_exact_match", allowed_values=["correct", "incorrect"])
def labels_exact_match(prediction: str, expected_labels: str) -> MetricResult

@discrete_metric(name="priority_accuracy", allowed_values=["correct", "incorrect"])
def priority_accuracy(prediction: str, expected_priority: str) -> MetricResult

@experiment()
async def support_triage_experiment(row, prompt_file: str, experiment_name: str) -> dict

def load_dataset() -> Dataset

def compare_inputs_to_output(inputs: List[str], output_path: Optional[str] = None) -> str

async def run_command(prompt_file: str, name: Optional[str]) -> None

def compare_command(inputs: List[str], output: Optional[str]) -> None

def build_parser() -> argparse.ArgumentParser

Import

from examples.iterate_prompt.evals import (
    labels_exact_match,
    priority_accuracy,
    support_triage_experiment,
    load_dataset,
    compare_inputs_to_output,
)

I/O Contract

Inputs

labels_exact_match

Name	Type	Required	Description
prediction	str	Yes	JSON string containing a "labels" key with a list of predicted label strings
expected_labels	str	Yes	Semicolon-separated string of expected labels

priority_accuracy

Name	Type	Required	Description
prediction	str	Yes	JSON string containing a "priority" key with the predicted priority value
expected_priority	str	Yes	The expected priority string to compare against

support_triage_experiment

Name	Type	Required	Description
row	dict	Yes	A dataset row containing "id", "text", "labels", and "priority" fields
prompt_file	str	Yes	Path to the prompt file to use for the LLM call
experiment_name	str	Yes	Name identifier for the experiment run

compare_inputs_to_output

Name	Type	Required	Description
inputs	List[str]	Yes	List of at least two CSV file paths from experiment runs to compare
output_path	Optional[str]	No	Output CSV path; defaults to a timestamped file under experiments/

Outputs

labels_exact_match / priority_accuracy

Name	Type	Description
return	MetricResult	Object with value ("correct" or "incorrect") and a reason string

support_triage_experiment

Name	Type	Description
return	dict	Dictionary with id, text, response, experiment_name, expected/predicted labels and priority, plus labels_score and priority_score

compare_inputs_to_output

Name	Type	Description
return	str	Full file path of the generated comparison CSV

Usage Examples

Running an Experiment via CLI

# Run a single prompt evaluation
python evals.py run --prompt_file prompts/v1.txt --name baseline

# Compare two experiment results
python evals.py compare --inputs experiments/run1.csv experiments/run2.csv --output comparison.csv

Programmatic Usage

import asyncio
from evals import load_dataset, support_triage_experiment, compare_inputs_to_output

# Load the support triage dataset
dataset = load_dataset()

# Run an experiment asynchronously
results = asyncio.run(
    support_triage_experiment.arun(
        dataset,
        name="my-experiment",
        prompt_file="prompts/v1.txt",
        experiment_name="baseline",
    )
)

# Compare multiple experiment outputs
output_path = compare_inputs_to_output(
    inputs=["experiments/baseline.csv", "experiments/improved.csv"]
)
print(f"Comparison saved to: {output_path}")

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment