Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Explodinggradients Ragas Agent Evaluation

From Leeroopedia


Knowledge Sources
Domains LLMs, Agents, Evaluation, Tool_Use
Last Updated 2026-02-10 06:00 GMT

Overview

End-to-end process for evaluating AI agents that use tool calling, multi-step reasoning, and multi-turn conversations using Ragas metrics and the experiment framework.

Description

This workflow covers the evaluation of LLM-powered agents that interact with tools, APIs, or external systems. It addresses both single-turn agents (one request, one response with tool calls) and multi-turn agents (conversational workflows with multiple interactions). The evaluation captures not just final answer correctness but also tool call accuracy, goal achievement, and topic adherence. Ragas provides specialized agent metrics (ToolCallAccuracy, ToolCallF1, AgentGoalAccuracy, TopicAdherenceScore) alongside general-purpose metrics, and supports integration with major agent frameworks (LangGraph, LlamaIndex, OpenAI Swarm, AG-UI).

Usage

Execute this workflow when you have an AI agent that uses function calling, tool invocation, or multi-step reasoning and need to evaluate its end-to-end performance. You should have test scenarios with expected outcomes (correct final answers, expected tool call sequences, or goal completion criteria). This is essential for agents in production that handle customer requests, data processing, or automated decision-making.

Execution Steps

Step 1: Define the Agent Under Test

Set up the agent that will be evaluated. This could be an OpenAI function-calling agent, a LangGraph ReAct agent, a LlamaIndex agent, or a custom implementation. The agent should accept inputs (questions, tasks) and produce outputs (responses, tool calls, conversation history). Ensure the agent can be invoked programmatically from the experiment function.

Key considerations:

  • The agent must expose a callable interface for batch evaluation
  • For multi-turn agents, capture the full conversation history
  • Tool definitions should be well-documented for accurate evaluation
  • Framework-specific integrations (LangGraph, LlamaIndex, Swarm) provide automatic message conversion

Step 2: Prepare the Agent Test Dataset

Create a dataset with test scenarios, expected outcomes, and evaluation criteria. For tool-calling agents, include expected tool call sequences. For goal-oriented agents, include goal descriptions and success criteria. For multi-turn agents, include conversation contexts or interaction scripts.

Key considerations:

  • Use SingleTurnSample for single-request agents, MultiTurnSample for conversational agents
  • Include reference tool calls for ToolCallAccuracy evaluation
  • Include goal descriptions for AgentGoalAccuracy evaluation
  • Cover diverse scenarios: happy paths, edge cases, ambiguous inputs, error recovery

Step 3: Select Agent-Specific Metrics

Choose evaluation metrics appropriate for the agent type. Ragas provides specialized agent metrics alongside general-purpose metrics. Combine multiple metrics to capture different quality dimensions of agent behavior.

Available agent metrics:

  • ToolCallAccuracy: Evaluates sequence alignment and argument matching of tool calls
  • ToolCallF1: Computes F1 score between predicted and reference tool calls
  • AgentGoalAccuracy: Measures whether the agent achieved the intended goal
  • TopicAdherenceScore: Evaluates whether the agent stays on-topic in multi-turn conversations

General metrics often used with agents:

  • Custom discrete metrics for final answer correctness
  • Custom numeric metrics for response quality scoring

Step 4: Run the Agent Evaluation Experiment

Use the @experiment decorator to wrap the agent evaluation function. For each test case, invoke the agent, capture its output (response, tool calls, conversation), and score it using the selected metrics. The experiment framework handles async execution and result persistence.

Key considerations:

  • Async agent invocation enables efficient batch evaluation
  • Capture both the final response and intermediate tool calls
  • Framework integration adapters convert native message formats to Ragas format
  • For LangGraph: use convert_to_ragas_messages()
  • For Swarm: use the swarm integration adapter

Step 5: Analyze Agent Performance and Iterate

Review experiment results to identify agent weaknesses. Examine tool call accuracy, goal completion rates, and per-scenario breakdowns. Use insights to improve agent prompts, tool definitions, or reasoning strategies. Re-run experiments to measure improvement.

Key considerations:

  • Analyze tool call sequences to identify common misrouting patterns
  • Check if the agent struggles with specific tool argument types
  • For multi-turn agents, look for conversation derailment patterns
  • Compare agent performance across different LLM backends

Execution Diagram

GitHub URL

Workflow Repository