Overview
This script demonstrates how to evaluate agents built with the AG-UI protocol using Ragas metrics through the @experiment decorator pattern, covering both single-turn factual Q&A and multi-turn tool-calling scenarios.
Description
The experiments.py module implements two end-to-end experiment scenarios for evaluating AG-UI compatible agents:
Scientist Biographies (Single-turn) -- Tests an agent's ability to provide factually correct and relevant information about scientists. This experiment loads a CSV dataset of scientist biographies and evaluates agent responses using three metrics:
- FactualCorrectness -- LLM-based metric measuring factual accuracy with configurable atomicity and coverage settings (both set to "high"), using F1 scoring mode.
- AnswerRelevancy -- LLM-and-embedding-based metric measuring response relevance to the user input, with strictness set to 2.
- DiscreteMetric (conciseness) -- A custom discrete metric that classifies responses as either "verbose" or "concise".
Weather Tool Usage (Multi-turn) -- Tests an agent's ability to correctly invoke weather tools and achieve user goals. This experiment loads a CSV dataset of weather tool call scenarios and evaluates using:
- ToolCallF1 -- Rule-based metric that computes F1 score for tool call accuracy by comparing predicted tool calls against reference tool calls parsed from JSON.
- AgentGoalAccuracyWithReference -- LLM-based metric assessing whether the agent achieved the user's goal given a reference answer.
Both experiments use the run_ag_ui_row integration function to call the AG-UI endpoint and enrich dataset rows with agent responses. The module creates fresh evaluator LLM and embedding components for each experiment via create_evaluator_components, which instantiates AsyncOpenAI clients and wraps them with Ragas llm_factory and embedding_factory.
The CLI supports --endpoint-url for specifying the AG-UI agent endpoint, --evaluator-model for the OpenAI model used in evaluation (defaults to gpt-4o-mini), and --skip-factual / --skip-tool-experiment flags to selectively run experiments. An embeddings sanity check runs before experiments begin to verify OpenAI API connectivity.
Usage
Run this script when you need to evaluate an AG-UI protocol compatible agent for factual correctness, answer relevancy, tool call accuracy, or goal achievement. It requires an AG-UI agent running at an accessible endpoint and valid OpenAI API credentials.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: examples/ragas_examples/ag_ui_agent_experiments/experiments.py
Signature
def load_scientist_dataset() -> Dataset
def load_weather_dataset() -> Dataset
def create_evaluator_components(model_name: str) -> tuple
async def run_scientist_experiment(endpoint_url: str, evaluator_model: str) -> tuple
async def run_tool_experiment(endpoint_url: str, evaluator_model: str) -> tuple
async def main() -> None
Import
from ragas.dataset import Dataset
from ragas.experiment import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.llms import llm_factory
from ragas.embeddings.base import embedding_factory
from ragas.metrics import DiscreteMetric
from ragas.metrics.collections import (
AgentGoalAccuracyWithReference,
AnswerRelevancy,
FactualCorrectness,
ToolCallF1,
)
I/O Contract
Inputs
run_scientist_experiment
| Name |
Type |
Required |
Description
|
| endpoint_url |
str |
Yes |
The AG-UI endpoint URL where the agent is running
|
| evaluator_model |
str |
Yes |
OpenAI model name for the evaluator LLM (e.g., "gpt-4o-mini")
|
run_tool_experiment
| Name |
Type |
Required |
Description
|
| endpoint_url |
str |
Yes |
The AG-UI endpoint URL where the agent is running
|
| evaluator_model |
str |
Yes |
OpenAI model name for the evaluator LLM (e.g., "gpt-4o-mini")
|
create_evaluator_components
| Name |
Type |
Required |
Description
|
| model_name |
str |
Yes |
OpenAI model name to use for the evaluator LLM
|
CLI Arguments
| Name |
Type |
Required |
Description
|
| --endpoint-url |
str |
No |
AG-UI endpoint URL (default: http://localhost:8000)
|
| --evaluator-model |
str |
No |
OpenAI model for evaluation (default: gpt-4o-mini)
|
| --skip-factual |
flag |
No |
Skip the scientist biographies experiment
|
| --skip-tool-experiment |
flag |
No |
Skip the weather tool usage experiment
|
Outputs
run_scientist_experiment
| Name |
Type |
Description
|
| return |
tuple |
Tuple of (Experiment result, pandas DataFrame) containing per-row factual_correctness, answer_relevancy, and conciseness scores
|
run_tool_experiment
| Name |
Type |
Description
|
| return |
tuple |
Tuple of (Experiment result, pandas DataFrame) containing per-row tool_call_f1 and agent_goal_accuracy scores
|
create_evaluator_components
| Name |
Type |
Description
|
| return |
tuple |
Tuple of (evaluator_llm, evaluator_embeddings) -- Ragas LLM wrapper and embedding wrapper
|
Usage Examples
Running via CLI
# Run all experiments against a local agent
python experiments.py --endpoint-url http://localhost:8000/chat
# Run only the tool experiment
python experiments.py --endpoint-url http://localhost:8000/chat --skip-factual
# Run only the scientist experiment with a specific evaluator model
python experiments.py --endpoint-url http://localhost:8000 --skip-tool-experiment --evaluator-model gpt-4o
Programmatic Usage
import asyncio
from experiments import run_scientist_experiment, run_tool_experiment
# Run the scientist biographies experiment
result, df = asyncio.run(
run_scientist_experiment(
endpoint_url="http://localhost:8000/chat",
evaluator_model="gpt-4o-mini",
)
)
# Inspect results
print(df[["user_input", "factual_correctness", "answer_relevancy", "conciseness"]])
print(f"Average factual correctness: {df['factual_correctness'].mean():.4f}")
Related Pages