Implementation:Vibrantlabsai Ragas AgUI Experiments

Knowledge Sources	Vibrantlabsai_Ragas
Domains	LLM Evaluation, Agent Evaluation, AG-UI Protocol
Last Updated	2026-02-12 00:00 GMT

Overview

This script demonstrates how to evaluate agents built with the AG-UI protocol using Ragas metrics through the @experiment decorator pattern, covering both single-turn factual Q&A and multi-turn tool-calling scenarios.

Description

The experiments.py module implements two end-to-end experiment scenarios for evaluating AG-UI compatible agents:

Scientist Biographies (Single-turn) -- Tests an agent's ability to provide factually correct and relevant information about scientists. This experiment loads a CSV dataset of scientist biographies and evaluates agent responses using three metrics:

FactualCorrectness -- LLM-based metric measuring factual accuracy with configurable atomicity and coverage settings (both set to "high"), using F1 scoring mode.
AnswerRelevancy -- LLM-and-embedding-based metric measuring response relevance to the user input, with strictness set to 2.
DiscreteMetric (conciseness) -- A custom discrete metric that classifies responses as either "verbose" or "concise".

Weather Tool Usage (Multi-turn) -- Tests an agent's ability to correctly invoke weather tools and achieve user goals. This experiment loads a CSV dataset of weather tool call scenarios and evaluates using:

ToolCallF1 -- Rule-based metric that computes F1 score for tool call accuracy by comparing predicted tool calls against reference tool calls parsed from JSON.
AgentGoalAccuracyWithReference -- LLM-based metric assessing whether the agent achieved the user's goal given a reference answer.

Both experiments use the run_ag_ui_row integration function to call the AG-UI endpoint and enrich dataset rows with agent responses. The module creates fresh evaluator LLM and embedding components for each experiment via create_evaluator_components, which instantiates AsyncOpenAI clients and wraps them with Ragas llm_factory and embedding_factory.

The CLI supports --endpoint-url for specifying the AG-UI agent endpoint, --evaluator-model for the OpenAI model used in evaluation (defaults to gpt-4o-mini), and --skip-factual / --skip-tool-experiment flags to selectively run experiments. An embeddings sanity check runs before experiments begin to verify OpenAI API connectivity.

Usage

Run this script when you need to evaluate an AG-UI protocol compatible agent for factual correctness, answer relevancy, tool call accuracy, or goal achievement. It requires an AG-UI agent running at an accessible endpoint and valid OpenAI API credentials.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: examples/ragas_examples/ag_ui_agent_experiments/experiments.py

Signature

def load_scientist_dataset() -> Dataset

def load_weather_dataset() -> Dataset

def create_evaluator_components(model_name: str) -> tuple

async def run_scientist_experiment(endpoint_url: str, evaluator_model: str) -> tuple

async def run_tool_experiment(endpoint_url: str, evaluator_model: str) -> tuple

async def main() -> None

Import

from ragas.dataset import Dataset
from ragas.experiment import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.llms import llm_factory
from ragas.embeddings.base import embedding_factory
from ragas.metrics import DiscreteMetric
from ragas.metrics.collections import (
    AgentGoalAccuracyWithReference,
    AnswerRelevancy,
    FactualCorrectness,
    ToolCallF1,
)

I/O Contract

Inputs

run_scientist_experiment

Name	Type	Required	Description
endpoint_url	str	Yes	The AG-UI endpoint URL where the agent is running
evaluator_model	str	Yes	OpenAI model name for the evaluator LLM (e.g., "gpt-4o-mini")

run_tool_experiment

Name	Type	Required	Description
endpoint_url	str	Yes	The AG-UI endpoint URL where the agent is running
evaluator_model	str	Yes	OpenAI model name for the evaluator LLM (e.g., "gpt-4o-mini")

create_evaluator_components

Name	Type	Required	Description
model_name	str	Yes	OpenAI model name to use for the evaluator LLM

CLI Arguments

Name	Type	Required	Description
--endpoint-url	str	No	AG-UI endpoint URL (default: http://localhost:8000)
--evaluator-model	str	No	OpenAI model for evaluation (default: gpt-4o-mini)
--skip-factual	flag	No	Skip the scientist biographies experiment
--skip-tool-experiment	flag	No	Skip the weather tool usage experiment

Outputs

run_scientist_experiment

Name	Type	Description
return	tuple	Tuple of (Experiment result, pandas DataFrame) containing per-row factual_correctness, answer_relevancy, and conciseness scores

run_tool_experiment

Name	Type	Description
return	tuple	Tuple of (Experiment result, pandas DataFrame) containing per-row tool_call_f1 and agent_goal_accuracy scores

create_evaluator_components

Name	Type	Description
return	tuple	Tuple of (evaluator_llm, evaluator_embeddings) -- Ragas LLM wrapper and embedding wrapper

Usage Examples

Running via CLI

# Run all experiments against a local agent
python experiments.py --endpoint-url http://localhost:8000/chat

# Run only the tool experiment
python experiments.py --endpoint-url http://localhost:8000/chat --skip-factual

# Run only the scientist experiment with a specific evaluator model
python experiments.py --endpoint-url http://localhost:8000 --skip-tool-experiment --evaluator-model gpt-4o

Programmatic Usage

import asyncio
from experiments import run_scientist_experiment, run_tool_experiment

# Run the scientist biographies experiment
result, df = asyncio.run(
    run_scientist_experiment(
        endpoint_url="http://localhost:8000/chat",
        evaluator_model="gpt-4o-mini",
    )
)

# Inspect results
print(df[["user_input", "factual_correctness", "answer_relevancy", "conciseness"]])
print(f"Average factual correctness: {df['factual_correctness'].mean():.4f}")

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment