Workflow:Vibrantlabsai Ragas Agent Evaluation

Knowledge Sources	Ragas Ragas Docs Agent Metrics
Domains	LLM_Ops, Evaluation, AI_Agents
Last Updated	2026-02-12 10:00 GMT

Overview

End-to-end process for evaluating AI agents that use tool calling, multi-turn conversations, and autonomous decision-making using Ragas agent-specific metrics.

Description

This workflow covers the evaluation of AI agents (tool-calling LLMs, multi-step reasoning systems, agentic workflows) using specialized Ragas metrics. Unlike simple prompt or RAG evaluation, agent evaluation must assess tool call accuracy, goal completion, topic adherence across conversation turns, and the correctness of multi-step reasoning chains. Ragas provides metrics including ToolCallAccuracy, ToolCallF1, AgentGoalAccuracy, and TopicAdherenceScore that operate on multi-turn conversation samples with tool call sequences.

Key outputs:

Per-sample scores for agent-specific metrics
Tool call accuracy and F1 scores
Goal completion rates across test scenarios
Multi-turn conversation quality assessments

Usage

Execute this workflow when you have an AI agent that makes tool calls, conducts multi-turn conversations, or performs autonomous multi-step reasoning, and you need to measure how accurately it selects tools, passes arguments, achieves goals, and stays on topic. This is appropriate for evaluating function-calling agents, ReAct agents, LangGraph workflows, LlamaIndex agents, or OpenAI Swarm multi-agent systems.

Execution Steps

Step 1: Capture_Agent_Traces

Run the agent on test scenarios and capture execution traces including all LLM calls, tool selections, tool arguments, tool results, and final outputs. Each trace represents a complete multi-turn interaction. Traces are structured as message sequences with role, content, and tool_call fields. Framework-specific converters are available for LangGraph, LlamaIndex, Amazon Bedrock, and OpenAI Swarm.

Key considerations:

Each scenario should have a defined goal and expected tool call sequence
Capture both the agent's actions and the environment's responses
Use framework-specific integration modules for automatic trace conversion
Store traces with metadata for reproducibility

Step 2: Prepare_Multi_Turn_Samples

Convert captured traces into MultiTurnSample objects that Ragas metrics can process. Each sample contains the user_input (initial query or goal), the reference (expected outcome), reference_tool_calls (expected tool call sequence), and the full message history as a list of HumanMessage, AIMessage, and ToolMessage objects.

Key considerations:

MultiTurnSample requires messages in Ragas message format
reference_tool_calls define the expected tool call names and arguments
Framework converters (convert_to_ragas_messages) handle format translation
Build an EvaluationDataset from the multi-turn samples

Step 3: Select_Agent_Metrics

Choose which agent-specific metrics to evaluate. ToolCallAccuracy measures whether the agent called the right tools with correct arguments in the right order using sequence alignment. ToolCallF1 computes F1 between expected and actual tool call sets. AgentGoalAccuracy uses an LLM judge to assess whether the agent achieved its intended goal. TopicAdherenceScore measures whether the agent stayed within defined topic boundaries during the conversation.

Metric selection guide:

ToolCallAccuracy: For agents with deterministic expected tool sequences
ToolCallF1: For agents where tool call order does not matter
AgentGoalAccuracy: For goal-oriented agents with defined success criteria
TopicAdherenceScore: For conversational agents with topic constraints

Step 4: Run_Evaluation

Execute the evaluation using evaluate() with the multi-turn dataset and selected agent metrics. The evaluation handles multi-turn samples differently from single-turn samples, calling multi_turn_ascore() on each metric. The Executor manages concurrent evaluation with the same concurrency and retry infrastructure as single-turn evaluation.

What happens:

Metrics process the full message history for each sample
ToolCallAccuracy aligns predicted tool calls against reference using sequence matching
AgentGoalAccuracy sends the full conversation to an LLM judge for assessment
Results are aggregated into an EvaluationResult

Step 5: Analyze_Agent_Performance

Analyze results to identify specific failure patterns in agent behavior. Low ToolCallAccuracy may indicate incorrect tool selection or argument errors. Low AgentGoalAccuracy suggests the agent fails to complete its assigned tasks. Low TopicAdherenceScore indicates the agent drifts off-topic. Use per-sample analysis to identify which scenarios and tool call patterns cause failures.

Key considerations:

Examine individual failed samples to understand error patterns
Compare tool call sequences to identify systematic tool selection errors
Track metrics across agent iterations to measure improvement
Use the experiment framework for structured agent evaluation across versions

Execution Diagram

GitHub URL

Workflow Repository