Workflow:Explodinggradients Ragas Agent Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Agents, Evaluation, Tool_Use |
| Last Updated | 2026-02-10 06:00 GMT |
Overview
End-to-end process for evaluating AI agents that use tool calling, multi-step reasoning, and multi-turn conversations using Ragas metrics and the experiment framework.
Description
This workflow covers the evaluation of LLM-powered agents that interact with tools, APIs, or external systems. It addresses both single-turn agents (one request, one response with tool calls) and multi-turn agents (conversational workflows with multiple interactions). The evaluation captures not just final answer correctness but also tool call accuracy, goal achievement, and topic adherence. Ragas provides specialized agent metrics (ToolCallAccuracy, ToolCallF1, AgentGoalAccuracy, TopicAdherenceScore) alongside general-purpose metrics, and supports integration with major agent frameworks (LangGraph, LlamaIndex, OpenAI Swarm, AG-UI).
Usage
Execute this workflow when you have an AI agent that uses function calling, tool invocation, or multi-step reasoning and need to evaluate its end-to-end performance. You should have test scenarios with expected outcomes (correct final answers, expected tool call sequences, or goal completion criteria). This is essential for agents in production that handle customer requests, data processing, or automated decision-making.
Execution Steps
Step 1: Define the Agent Under Test
Set up the agent that will be evaluated. This could be an OpenAI function-calling agent, a LangGraph ReAct agent, a LlamaIndex agent, or a custom implementation. The agent should accept inputs (questions, tasks) and produce outputs (responses, tool calls, conversation history). Ensure the agent can be invoked programmatically from the experiment function.
Key considerations:
- The agent must expose a callable interface for batch evaluation
- For multi-turn agents, capture the full conversation history
- Tool definitions should be well-documented for accurate evaluation
- Framework-specific integrations (LangGraph, LlamaIndex, Swarm) provide automatic message conversion
Step 2: Prepare the Agent Test Dataset
Create a dataset with test scenarios, expected outcomes, and evaluation criteria. For tool-calling agents, include expected tool call sequences. For goal-oriented agents, include goal descriptions and success criteria. For multi-turn agents, include conversation contexts or interaction scripts.
Key considerations:
- Use SingleTurnSample for single-request agents, MultiTurnSample for conversational agents
- Include reference tool calls for ToolCallAccuracy evaluation
- Include goal descriptions for AgentGoalAccuracy evaluation
- Cover diverse scenarios: happy paths, edge cases, ambiguous inputs, error recovery
Step 3: Select Agent-Specific Metrics
Choose evaluation metrics appropriate for the agent type. Ragas provides specialized agent metrics alongside general-purpose metrics. Combine multiple metrics to capture different quality dimensions of agent behavior.
Available agent metrics:
- ToolCallAccuracy: Evaluates sequence alignment and argument matching of tool calls
- ToolCallF1: Computes F1 score between predicted and reference tool calls
- AgentGoalAccuracy: Measures whether the agent achieved the intended goal
- TopicAdherenceScore: Evaluates whether the agent stays on-topic in multi-turn conversations
General metrics often used with agents:
- Custom discrete metrics for final answer correctness
- Custom numeric metrics for response quality scoring
Step 4: Run the Agent Evaluation Experiment
Use the @experiment decorator to wrap the agent evaluation function. For each test case, invoke the agent, capture its output (response, tool calls, conversation), and score it using the selected metrics. The experiment framework handles async execution and result persistence.
Key considerations:
- Async agent invocation enables efficient batch evaluation
- Capture both the final response and intermediate tool calls
- Framework integration adapters convert native message formats to Ragas format
- For LangGraph: use convert_to_ragas_messages()
- For Swarm: use the swarm integration adapter
Step 5: Analyze Agent Performance and Iterate
Review experiment results to identify agent weaknesses. Examine tool call accuracy, goal completion rates, and per-scenario breakdowns. Use insights to improve agent prompts, tool definitions, or reasoning strategies. Re-run experiments to measure improvement.
Key considerations:
- Analyze tool call sequences to identify common misrouting patterns
- Check if the agent struggles with specific tool argument types
- For multi-turn agents, look for conversation derailment patterns
- Compare agent performance across different LLM backends