Implementation:Explodinggradients Ragas AG UI Experiments Module
| Field | Value |
|---|---|
| source | Explodinggradients_Ragas|https://github.com/explodinggradients/ragas |
| domains | Examples, Agent_Evaluation |
| last_updated | 2026-02-10 00:00 GMT |
Overview
An example module demonstrating how to run AG-UI agent evaluation experiments using the Ragas @experiment decorator pattern with factual correctness, answer relevancy, tool call, and goal accuracy metrics.
Description
The experiments.py module defines two experiment scenarios for evaluating agents built with the AG-UI protocol. The Scientist Biographies experiment is a single-turn Q&A test that measures factual correctness, answer relevancy, and conciseness. The Weather Tool Usage experiment is a multi-turn test that measures tool call F1 accuracy and agent goal achievement against a reference. Both experiments use the @experiment() decorator from ragas.experiment to define evaluation functions that call an AG-UI endpoint via run_ag_ui_row, score responses using metrics from ragas.metrics.collections, and return enriched result dictionaries. The module includes a CLI with argparse for specifying the endpoint URL, evaluator model, and which experiments to skip. Datasets are loaded from CSV files using ragas.dataset.Dataset.load() with a local/csv backend.
Usage
This module is run as a standalone script against a running AG-UI compatible agent endpoint. It requires an OpenAI API key for the evaluator LLM and embeddings.
python experiments.py --endpoint-url http://localhost:8000/chat
python experiments.py --endpoint-url http://localhost:8000/chat --skip-tool-experiment
python experiments.py --endpoint-url http://localhost:8000 --skip-factual
Code Reference
| Field | Value |
|---|---|
| Source Location | examples/ragas_examples/ag_ui_agent_experiments/experiments.py
|
| File Size | 398 lines |
Function Signatures
def load_scientist_dataset() -> Dataset
def load_weather_dataset() -> Dataset
async def run_scientist_experiment(endpoint_url: str, evaluator_model: str) -> tuple
async def run_tool_experiment(endpoint_url: str, evaluator_model: str) -> tuple
Key Imports
from ragas.dataset import Dataset
from ragas.experiment import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.llms import llm_factory
from ragas.embeddings.base import embedding_factory
from ragas.metrics import DiscreteMetric
from ragas.metrics.collections import (
AgentGoalAccuracyWithReference, AnswerRelevancy,
FactualCorrectness, ToolCallF1,
)
I/O Contract
| Function | Input | Output |
|---|---|---|
| load_scientist_dataset | None (reads from test_data/scientist_biographies.csv) |
Dataset with user_input and reference fields
|
| load_weather_dataset | None (reads from test_data/weather_tool_calls.csv) |
Dataset with user_input, reference, and reference_tool_calls fields
|
| run_scientist_experiment | endpoint_url: str, evaluator_model: str |
tuple(Experiment, DataFrame) with factual_correctness, answer_relevancy, conciseness columns
|
| run_tool_experiment | endpoint_url: str, evaluator_model: str |
tuple(Experiment, DataFrame) with tool_call_f1 and agent_goal_accuracy columns
|
Usage Examples
import asyncio
from ragas_examples.ag_ui_agent_experiments.experiments import (
run_scientist_experiment,
run_tool_experiment,
)
# Run the scientist biographies experiment
result, df = asyncio.run(
run_scientist_experiment("http://localhost:8000/chat", "gpt-4o-mini")
)
print(df[["factual_correctness", "answer_relevancy"]].mean())
# Run the weather tool experiment
result, df = asyncio.run(
run_tool_experiment("http://localhost:8000/chat", "gpt-4o-mini")
)
print(df[["tool_call_f1", "agent_goal_accuracy"]].mean())
Related Pages
- Explodinggradients_Ragas_Text2SQL_Data_Utils -- Another example evaluation pipeline for Text-to-SQL tasks
- Explodinggradients_Ragas_MkDocs_Configuration -- Documentation site with AG-UI integration guide