Implementation:Explodinggradients Ragas AG UI Experiments Module

Field	Value
source	Explodinggradients_Ragas\|https://github.com/explodinggradients/ragas
domains	Examples, Agent_Evaluation
last_updated	2026-02-10 00:00 GMT

Overview

An example module demonstrating how to run AG-UI agent evaluation experiments using the Ragas @experiment decorator pattern with factual correctness, answer relevancy, tool call, and goal accuracy metrics.

Description

The experiments.py module defines two experiment scenarios for evaluating agents built with the AG-UI protocol. The Scientist Biographies experiment is a single-turn Q&A test that measures factual correctness, answer relevancy, and conciseness. The Weather Tool Usage experiment is a multi-turn test that measures tool call F1 accuracy and agent goal achievement against a reference. Both experiments use the @experiment() decorator from ragas.experiment to define evaluation functions that call an AG-UI endpoint via run_ag_ui_row, score responses using metrics from ragas.metrics.collections, and return enriched result dictionaries. The module includes a CLI with argparse for specifying the endpoint URL, evaluator model, and which experiments to skip. Datasets are loaded from CSV files using ragas.dataset.Dataset.load() with a local/csv backend.

Usage

This module is run as a standalone script against a running AG-UI compatible agent endpoint. It requires an OpenAI API key for the evaluator LLM and embeddings.

python experiments.py --endpoint-url http://localhost:8000/chat
python experiments.py --endpoint-url http://localhost:8000/chat --skip-tool-experiment
python experiments.py --endpoint-url http://localhost:8000 --skip-factual

Code Reference

Field	Value
Source Location	`examples/ragas_examples/ag_ui_agent_experiments/experiments.py`
File Size	398 lines

Function Signatures

def load_scientist_dataset() -> Dataset

def load_weather_dataset() -> Dataset

async def run_scientist_experiment(endpoint_url: str, evaluator_model: str) -> tuple

async def run_tool_experiment(endpoint_url: str, evaluator_model: str) -> tuple

Key Imports

from ragas.dataset import Dataset
from ragas.experiment import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.llms import llm_factory
from ragas.embeddings.base import embedding_factory
from ragas.metrics import DiscreteMetric
from ragas.metrics.collections import (
    AgentGoalAccuracyWithReference, AnswerRelevancy,
    FactualCorrectness, ToolCallF1,
)

I/O Contract

Function	Input	Output
load_scientist_dataset	None (reads from `test_data/scientist_biographies.csv`)	`Dataset` with user_input and reference fields
load_weather_dataset	None (reads from `test_data/weather_tool_calls.csv`)	`Dataset` with user_input, reference, and reference_tool_calls fields
run_scientist_experiment	`endpoint_url: str`, `evaluator_model: str`	`tuple(Experiment, DataFrame)` with factual_correctness, answer_relevancy, conciseness columns
run_tool_experiment	`endpoint_url: str`, `evaluator_model: str`	`tuple(Experiment, DataFrame)` with tool_call_f1 and agent_goal_accuracy columns

Usage Examples

import asyncio
from ragas_examples.ag_ui_agent_experiments.experiments import (
    run_scientist_experiment,
    run_tool_experiment,
)

# Run the scientist biographies experiment
result, df = asyncio.run(
    run_scientist_experiment("http://localhost:8000/chat", "gpt-4o-mini")
)
print(df[["factual_correctness", "answer_relevancy"]].mean())

# Run the weather tool experiment
result, df = asyncio.run(
    run_tool_experiment("http://localhost:8000/chat", "gpt-4o-mini")
)
print(df[["tool_call_f1", "agent_goal_accuracy"]].mean())

Related Pages

Explodinggradients_Ragas_Text2SQL_Data_Utils -- Another example evaluation pipeline for Text-to-SQL tasks
Explodinggradients_Ragas_MkDocs_Configuration -- Documentation site with AG-UI integration guide

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment