Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Explodinggradients Ragas AG UI Experiments Module

From Leeroopedia


Field Value
source Explodinggradients_Ragas|https://github.com/explodinggradients/ragas
domains Examples, Agent_Evaluation
last_updated 2026-02-10 00:00 GMT

Overview

An example module demonstrating how to run AG-UI agent evaluation experiments using the Ragas @experiment decorator pattern with factual correctness, answer relevancy, tool call, and goal accuracy metrics.

Description

The experiments.py module defines two experiment scenarios for evaluating agents built with the AG-UI protocol. The Scientist Biographies experiment is a single-turn Q&A test that measures factual correctness, answer relevancy, and conciseness. The Weather Tool Usage experiment is a multi-turn test that measures tool call F1 accuracy and agent goal achievement against a reference. Both experiments use the @experiment() decorator from ragas.experiment to define evaluation functions that call an AG-UI endpoint via run_ag_ui_row, score responses using metrics from ragas.metrics.collections, and return enriched result dictionaries. The module includes a CLI with argparse for specifying the endpoint URL, evaluator model, and which experiments to skip. Datasets are loaded from CSV files using ragas.dataset.Dataset.load() with a local/csv backend.

Usage

This module is run as a standalone script against a running AG-UI compatible agent endpoint. It requires an OpenAI API key for the evaluator LLM and embeddings.

python experiments.py --endpoint-url http://localhost:8000/chat
python experiments.py --endpoint-url http://localhost:8000/chat --skip-tool-experiment
python experiments.py --endpoint-url http://localhost:8000 --skip-factual

Code Reference

Field Value
Source Location examples/ragas_examples/ag_ui_agent_experiments/experiments.py
File Size 398 lines

Function Signatures

def load_scientist_dataset() -> Dataset
def load_weather_dataset() -> Dataset
async def run_scientist_experiment(endpoint_url: str, evaluator_model: str) -> tuple
async def run_tool_experiment(endpoint_url: str, evaluator_model: str) -> tuple

Key Imports

from ragas.dataset import Dataset
from ragas.experiment import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.llms import llm_factory
from ragas.embeddings.base import embedding_factory
from ragas.metrics import DiscreteMetric
from ragas.metrics.collections import (
    AgentGoalAccuracyWithReference, AnswerRelevancy,
    FactualCorrectness, ToolCallF1,
)

I/O Contract

Function Input Output
load_scientist_dataset None (reads from test_data/scientist_biographies.csv) Dataset with user_input and reference fields
load_weather_dataset None (reads from test_data/weather_tool_calls.csv) Dataset with user_input, reference, and reference_tool_calls fields
run_scientist_experiment endpoint_url: str, evaluator_model: str tuple(Experiment, DataFrame) with factual_correctness, answer_relevancy, conciseness columns
run_tool_experiment endpoint_url: str, evaluator_model: str tuple(Experiment, DataFrame) with tool_call_f1 and agent_goal_accuracy columns

Usage Examples

import asyncio
from ragas_examples.ag_ui_agent_experiments.experiments import (
    run_scientist_experiment,
    run_tool_experiment,
)

# Run the scientist biographies experiment
result, df = asyncio.run(
    run_scientist_experiment("http://localhost:8000/chat", "gpt-4o-mini")
)
print(df[["factual_correctness", "answer_relevancy"]].mean())

# Run the weather tool experiment
result, df = asyncio.run(
    run_tool_experiment("http://localhost:8000/chat", "gpt-4o-mini")
)
print(df[["tool_call_f1", "agent_goal_accuracy"]].mean())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment