Workflow:Vibrantlabsai Ragas Agent Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Evaluation, AI_Agents |
| Last Updated | 2026-02-12 10:00 GMT |
Overview
End-to-end process for evaluating AI agents that use tool calling, multi-turn conversations, and autonomous decision-making using Ragas agent-specific metrics.
Description
This workflow covers the evaluation of AI agents (tool-calling LLMs, multi-step reasoning systems, agentic workflows) using specialized Ragas metrics. Unlike simple prompt or RAG evaluation, agent evaluation must assess tool call accuracy, goal completion, topic adherence across conversation turns, and the correctness of multi-step reasoning chains. Ragas provides metrics including ToolCallAccuracy, ToolCallF1, AgentGoalAccuracy, and TopicAdherenceScore that operate on multi-turn conversation samples with tool call sequences.
Key outputs:
- Per-sample scores for agent-specific metrics
- Tool call accuracy and F1 scores
- Goal completion rates across test scenarios
- Multi-turn conversation quality assessments
Usage
Execute this workflow when you have an AI agent that makes tool calls, conducts multi-turn conversations, or performs autonomous multi-step reasoning, and you need to measure how accurately it selects tools, passes arguments, achieves goals, and stays on topic. This is appropriate for evaluating function-calling agents, ReAct agents, LangGraph workflows, LlamaIndex agents, or OpenAI Swarm multi-agent systems.
Execution Steps
Step 1: Capture_Agent_Traces
Run the agent on test scenarios and capture execution traces including all LLM calls, tool selections, tool arguments, tool results, and final outputs. Each trace represents a complete multi-turn interaction. Traces are structured as message sequences with role, content, and tool_call fields. Framework-specific converters are available for LangGraph, LlamaIndex, Amazon Bedrock, and OpenAI Swarm.
Key considerations:
- Each scenario should have a defined goal and expected tool call sequence
- Capture both the agent's actions and the environment's responses
- Use framework-specific integration modules for automatic trace conversion
- Store traces with metadata for reproducibility
Step 2: Prepare_Multi_Turn_Samples
Convert captured traces into MultiTurnSample objects that Ragas metrics can process. Each sample contains the user_input (initial query or goal), the reference (expected outcome), reference_tool_calls (expected tool call sequence), and the full message history as a list of HumanMessage, AIMessage, and ToolMessage objects.
Key considerations:
- MultiTurnSample requires messages in Ragas message format
- reference_tool_calls define the expected tool call names and arguments
- Framework converters (convert_to_ragas_messages) handle format translation
- Build an EvaluationDataset from the multi-turn samples
Step 3: Select_Agent_Metrics
Choose which agent-specific metrics to evaluate. ToolCallAccuracy measures whether the agent called the right tools with correct arguments in the right order using sequence alignment. ToolCallF1 computes F1 between expected and actual tool call sets. AgentGoalAccuracy uses an LLM judge to assess whether the agent achieved its intended goal. TopicAdherenceScore measures whether the agent stayed within defined topic boundaries during the conversation.
Metric selection guide:
- ToolCallAccuracy: For agents with deterministic expected tool sequences
- ToolCallF1: For agents where tool call order does not matter
- AgentGoalAccuracy: For goal-oriented agents with defined success criteria
- TopicAdherenceScore: For conversational agents with topic constraints
Step 4: Run_Evaluation
Execute the evaluation using evaluate() with the multi-turn dataset and selected agent metrics. The evaluation handles multi-turn samples differently from single-turn samples, calling multi_turn_ascore() on each metric. The Executor manages concurrent evaluation with the same concurrency and retry infrastructure as single-turn evaluation.
What happens:
- Metrics process the full message history for each sample
- ToolCallAccuracy aligns predicted tool calls against reference using sequence matching
- AgentGoalAccuracy sends the full conversation to an LLM judge for assessment
- Results are aggregated into an EvaluationResult
Step 5: Analyze_Agent_Performance
Analyze results to identify specific failure patterns in agent behavior. Low ToolCallAccuracy may indicate incorrect tool selection or argument errors. Low AgentGoalAccuracy suggests the agent fails to complete its assigned tasks. Low TopicAdherenceScore indicates the agent drifts off-topic. Use per-sample analysis to identify which scenarios and tool call patterns cause failures.
Key considerations:
- Examine individual failed samples to understand error patterns
- Compare tool call sequences to identify systematic tool selection errors
- Track metrics across agent iterations to measure improvement
- Use the experiment framework for structured agent evaluation across versions