Workflow:Explodinggradients Ragas Agent Evaluation

Knowledge Sources	Ragas Ragas Docs Agent Eval Tutorial
Domains	LLMs, Agents, Evaluation, Tool_Use
Last Updated	2026-02-10 06:00 GMT

Overview

End-to-end process for evaluating AI agents that use tool calling, multi-step reasoning, and multi-turn conversations using Ragas metrics and the experiment framework.

Description

This workflow covers the evaluation of LLM-powered agents that interact with tools, APIs, or external systems. It addresses both single-turn agents (one request, one response with tool calls) and multi-turn agents (conversational workflows with multiple interactions). The evaluation captures not just final answer correctness but also tool call accuracy, goal achievement, and topic adherence. Ragas provides specialized agent metrics (ToolCallAccuracy, ToolCallF1, AgentGoalAccuracy, TopicAdherenceScore) alongside general-purpose metrics, and supports integration with major agent frameworks (LangGraph, LlamaIndex, OpenAI Swarm, AG-UI).

Usage

Execute this workflow when you have an AI agent that uses function calling, tool invocation, or multi-step reasoning and need to evaluate its end-to-end performance. You should have test scenarios with expected outcomes (correct final answers, expected tool call sequences, or goal completion criteria). This is essential for agents in production that handle customer requests, data processing, or automated decision-making.

Execution Steps

Step 1: Define the Agent Under Test

Set up the agent that will be evaluated. This could be an OpenAI function-calling agent, a LangGraph ReAct agent, a LlamaIndex agent, or a custom implementation. The agent should accept inputs (questions, tasks) and produce outputs (responses, tool calls, conversation history). Ensure the agent can be invoked programmatically from the experiment function.

Key considerations:

The agent must expose a callable interface for batch evaluation
For multi-turn agents, capture the full conversation history
Tool definitions should be well-documented for accurate evaluation
Framework-specific integrations (LangGraph, LlamaIndex, Swarm) provide automatic message conversion

Step 2: Prepare the Agent Test Dataset

Create a dataset with test scenarios, expected outcomes, and evaluation criteria. For tool-calling agents, include expected tool call sequences. For goal-oriented agents, include goal descriptions and success criteria. For multi-turn agents, include conversation contexts or interaction scripts.

Key considerations:

Use SingleTurnSample for single-request agents, MultiTurnSample for conversational agents
Include reference tool calls for ToolCallAccuracy evaluation
Include goal descriptions for AgentGoalAccuracy evaluation
Cover diverse scenarios: happy paths, edge cases, ambiguous inputs, error recovery

Step 3: Select Agent-Specific Metrics

Choose evaluation metrics appropriate for the agent type. Ragas provides specialized agent metrics alongside general-purpose metrics. Combine multiple metrics to capture different quality dimensions of agent behavior.

Available agent metrics:

ToolCallAccuracy: Evaluates sequence alignment and argument matching of tool calls
ToolCallF1: Computes F1 score between predicted and reference tool calls
AgentGoalAccuracy: Measures whether the agent achieved the intended goal
TopicAdherenceScore: Evaluates whether the agent stays on-topic in multi-turn conversations

General metrics often used with agents:

Custom discrete metrics for final answer correctness
Custom numeric metrics for response quality scoring

Step 4: Run the Agent Evaluation Experiment

Use the @experiment decorator to wrap the agent evaluation function. For each test case, invoke the agent, capture its output (response, tool calls, conversation), and score it using the selected metrics. The experiment framework handles async execution and result persistence.

Key considerations:

Async agent invocation enables efficient batch evaluation
Capture both the final response and intermediate tool calls
Framework integration adapters convert native message formats to Ragas format
For LangGraph: use convert_to_ragas_messages()
For Swarm: use the swarm integration adapter

Step 5: Analyze Agent Performance and Iterate

Review experiment results to identify agent weaknesses. Examine tool call accuracy, goal completion rates, and per-scenario breakdowns. Use insights to improve agent prompts, tool definitions, or reasoning strategies. Re-run experiments to measure improvement.

Key considerations:

Analyze tool call sequences to identify common misrouting patterns
Check if the agent struggles with specific tool argument types
For multi-turn agents, look for conversation derailment patterns
Compare agent performance across different LLM backends

Execution Diagram

GitHub URL

Workflow Repository