Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Confident ai Deepeval AI Agent Evaluation

From Leeroopedia
Knowledge Sources
Domains LLM_Evaluation, AI_Agents, Integration_Testing
Last Updated 2026-02-14 09:00 GMT

Overview

End-to-end process for evaluating AI agents built with popular frameworks (LangChain, LangGraph, CrewAI, PydanticAI, OpenAI Agents, LlamaIndex) using DeepEval's instrumentation and agent-specific metrics.

Description

This workflow covers the evaluation of AI agent systems that use tools, make multi-step decisions, and may involve multiple agents. DeepEval provides framework-specific instrumentators or callback handlers that automatically trace agent execution (LLM calls, tool invocations, retriever queries, agent handoffs) without modifying the agent code. Agent-specific metrics like TaskCompletionMetric, ToolUseMetric, StepEfficiencyMetric, and PlanQualityMetric evaluate whether the agent correctly completes tasks, uses tools appropriately, and operates efficiently. The workflow supports both end-to-end agent evaluation and component-level evaluation of individual agent steps.

Usage

Execute this workflow when you have an AI agent application built with a supported framework and need to evaluate its task completion, tool usage correctness, or step efficiency. This applies to agents that make tool calls, handle multi-turn conversations, delegate to sub-agents, or follow conditional routing logic.

Execution Steps

Step 1: Set Up Framework Instrumentation

Install the DeepEval integration for your agent framework and configure the instrumentation. Each framework has a specific integration mechanism: callback handlers for LangChain/LangGraph, instrumentators for CrewAI/PydanticAI/OpenAI Agents, or custom wrappers.

Framework-specific setup:

  • LangChain/LangGraph: Use CallbackHandler from deepeval integrations
  • CrewAI: Use ConfidentInstrumentationSettings with the Crew
  • PydanticAI: Use ConfidentInstrumentationSettings at agent level
  • OpenAI Agents: Install DeepEvalTracingProcessor as the trace processor
  • LlamaIndex: Use the deepeval callback handler

Step 2: Define Agent-Specific Metrics

Select metrics appropriate for evaluating agent behavior. Agent metrics evaluate higher-level concerns than standard LLM metrics, focusing on whether the agent achieves its goal and uses its tools correctly.

Agent metric categories:

  • Task completion: TaskCompletionMetric evaluates whether the agent fulfilled its assigned task
  • Tool correctness: ToolUseMetric and ArgumentCorrectnessMetric evaluate tool selection and parameter accuracy
  • Efficiency: StepEfficiencyMetric evaluates whether the agent used an optimal number of steps
  • Planning: PlanQualityMetric and PlanAdherenceMetric evaluate the agent's execution plan
  • Goal accuracy: GoalAccuracyMetric evaluates whether the agent reached the desired outcome
  • MCP metrics: MCPUseMetric and MCPTaskCompletionMetric for MCP server usage

Step 3: Prepare Evaluation Dataset

Create an EvaluationDataset with Golden objects representing test scenarios for the agent. Each golden contains an input query and optionally an expected_output that describes the desired agent behavior or result.

Dataset construction:

  • Define diverse test scenarios covering different agent capabilities
  • Include edge cases (ambiguous queries, multi-step tasks, error scenarios)
  • Set expected_output to describe successful task completion criteria
  • For conversational agents, use ConversationalGolden with multi-turn scenarios

Step 4: Execute Agent Evaluation

Run the instrumented agent against the evaluation dataset. The instrumentation automatically captures traces of all agent steps (LLM calls, tool invocations, retriever queries). Metrics are evaluated either at the trace level (end-to-end) or at individual span levels (component-level).

Execution patterns:

  • Dataset iterator: Loop through dataset.evals_iterator() and invoke the agent
  • Metric collections: Assign metric_collection names to map metrics to specific components
  • Async support: Use asyncio.create_task for async agent invocations
  • Online evaluation: Use named metric collections for production monitoring

Step 5: Analyze Agent Performance

Review agent evaluation results including task completion rates, tool usage accuracy, and step efficiency scores. The trace tree shows the complete execution flow with each agent decision, tool call, and LLM invocation as separate spans with their own metrics.

Analysis dimensions:

  • Overall task completion rate across test scenarios
  • Tool usage patterns and argument correctness per tool
  • Step count efficiency compared to optimal paths
  • Plan quality and adherence scores
  • Framework-specific span details (agent handoffs, retriever results, etc.)

Execution Diagram

GitHub URL

Workflow Repository