Workflow:Confident ai Deepeval AI Agent Evaluation

Knowledge Sources	DeepEval DeepEval Docs Agent Evals
Domains	LLM_Evaluation, AI_Agents, Integration_Testing
Last Updated	2026-02-14 09:00 GMT

Overview

End-to-end process for evaluating AI agents built with popular frameworks (LangChain, LangGraph, CrewAI, PydanticAI, OpenAI Agents, LlamaIndex) using DeepEval's instrumentation and agent-specific metrics.

Description

This workflow covers the evaluation of AI agent systems that use tools, make multi-step decisions, and may involve multiple agents. DeepEval provides framework-specific instrumentators or callback handlers that automatically trace agent execution (LLM calls, tool invocations, retriever queries, agent handoffs) without modifying the agent code. Agent-specific metrics like TaskCompletionMetric, ToolUseMetric, StepEfficiencyMetric, and PlanQualityMetric evaluate whether the agent correctly completes tasks, uses tools appropriately, and operates efficiently. The workflow supports both end-to-end agent evaluation and component-level evaluation of individual agent steps.

Usage

Execute this workflow when you have an AI agent application built with a supported framework and need to evaluate its task completion, tool usage correctness, or step efficiency. This applies to agents that make tool calls, handle multi-turn conversations, delegate to sub-agents, or follow conditional routing logic.

Execution Steps

Step 1: Set Up Framework Instrumentation

Install the DeepEval integration for your agent framework and configure the instrumentation. Each framework has a specific integration mechanism: callback handlers for LangChain/LangGraph, instrumentators for CrewAI/PydanticAI/OpenAI Agents, or custom wrappers.

Framework-specific setup:

LangChain/LangGraph: Use CallbackHandler from deepeval integrations
CrewAI: Use ConfidentInstrumentationSettings with the Crew
PydanticAI: Use ConfidentInstrumentationSettings at agent level
OpenAI Agents: Install DeepEvalTracingProcessor as the trace processor
LlamaIndex: Use the deepeval callback handler

Step 2: Define Agent-Specific Metrics

Select metrics appropriate for evaluating agent behavior. Agent metrics evaluate higher-level concerns than standard LLM metrics, focusing on whether the agent achieves its goal and uses its tools correctly.

Agent metric categories:

Task completion: TaskCompletionMetric evaluates whether the agent fulfilled its assigned task
Tool correctness: ToolUseMetric and ArgumentCorrectnessMetric evaluate tool selection and parameter accuracy
Efficiency: StepEfficiencyMetric evaluates whether the agent used an optimal number of steps
Planning: PlanQualityMetric and PlanAdherenceMetric evaluate the agent's execution plan
Goal accuracy: GoalAccuracyMetric evaluates whether the agent reached the desired outcome
MCP metrics: MCPUseMetric and MCPTaskCompletionMetric for MCP server usage

Step 3: Prepare Evaluation Dataset

Create an EvaluationDataset with Golden objects representing test scenarios for the agent. Each golden contains an input query and optionally an expected_output that describes the desired agent behavior or result.

Dataset construction:

Define diverse test scenarios covering different agent capabilities
Include edge cases (ambiguous queries, multi-step tasks, error scenarios)
Set expected_output to describe successful task completion criteria
For conversational agents, use ConversationalGolden with multi-turn scenarios

Step 4: Execute Agent Evaluation

Run the instrumented agent against the evaluation dataset. The instrumentation automatically captures traces of all agent steps (LLM calls, tool invocations, retriever queries). Metrics are evaluated either at the trace level (end-to-end) or at individual span levels (component-level).

Execution patterns:

Dataset iterator: Loop through dataset.evals_iterator() and invoke the agent
Metric collections: Assign metric_collection names to map metrics to specific components
Async support: Use asyncio.create_task for async agent invocations
Online evaluation: Use named metric collections for production monitoring

Step 5: Analyze Agent Performance

Review agent evaluation results including task completion rates, tool usage accuracy, and step efficiency scores. The trace tree shows the complete execution flow with each agent decision, tool call, and LLM invocation as separate spans with their own metrics.

Analysis dimensions:

Overall task completion rate across test scenarios
Tool usage patterns and argument correctness per tool
Step count efficiency compared to optimal paths
Plan quality and adherence scores
Framework-specific span details (agent handoffs, retriever results, etc.)

Execution Diagram

GitHub URL

Workflow Repository