Implementation:Confident ai Deepeval ToolUseMetric

**Metadata**
Knowledge Sources	DeepEval
Domains	LLM_Evaluation AI_Agents
Last Updated	2026-02-14 09:00 GMT

Overview

Concrete evaluation metric class that measures whether an AI agent selects and uses appropriate tools during conversational interactions. The ToolUseMetric evaluates tool selection correctness and argument accuracy against a provided list of available tools, using an LLM-as-judge approach.

Description

The ToolUseMetric is a conversational metric that evaluates tool use across one or more turns of agent interaction. It requires a list of available tools (defined as ToolCall objects) that represents the ground truth set of tools the agent has access to. The metric then assesses whether the agent's actual tool invocations were appropriate for the given user inputs.

Key capabilities:

Tool selection assessment -- evaluates whether the agent chose the correct tool(s) from the available set.
Argument accuracy evaluation -- assesses whether tool call arguments were correct and well-formed.
Conversational context -- as a BaseConversationalMetric, it evaluates tool use across the full conversation rather than individual turns.
Reason generation -- produces human-readable explanations of tool use quality.

Usage

Import and instantiate with available tools:

from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall

Code Reference

Source Location

Repository: confident-ai/deepeval
File: deepeval/metrics/tool_use/tool_use.py (lines 31--437)

Signature

class ToolUseMetric(BaseConversationalMetric):
    def __init__(
        self,
        available_tools: List[ToolCall],
        threshold: float = 0.5,
        model: Optional[str] = None,
        include_reason: bool = True,
        async_mode: bool = True,
        strict_mode: bool = False,
        verbose_mode: bool = False,
    ):
        ...

Import

from deepeval.metrics import ToolUseMetric

Parent Class

BaseConversationalMetric

I/O Contract

Inputs (Constructor Parameters)

**Input Contract**
Name	Type	Default	Description
`available_tools`	List[ToolCall]	REQUIRED	List of available tools the agent can use. Each `ToolCall` includes a name and description. This serves as ground truth for tool selection evaluation.
`threshold`	float	`0.5`	Minimum score (0--1) for the evaluation to pass.
`model`	Optional[str]	`None`	LLM model to use as the evaluation judge.
`include_reason`	bool	`True`	Whether to generate a human-readable reason for the score.
`async_mode`	bool	`True`	Whether to run evaluation asynchronously.
`strict_mode`	bool	`False`	When enabled, scores are binarized to 0 or 1 based on the threshold.
`verbose_mode`	bool	`False`	When enabled, prints detailed evaluation information during execution.

Outputs

**Output Contract**
Name	Type	Description
score	float	A value between 0 and 1 indicating tool use quality (selection correctness and argument accuracy).
reason	Optional[str]	Human-readable explanation of the score (when `include_reason=True`).
success	bool	Whether the score meets or exceeds the threshold.

Usage Examples

Example 1: Basic Tool Use Evaluation

Create a metric with available tools for evaluating agent tool selection.

from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall

tools = [
    ToolCall(name="search", description="Search the web"),
    ToolCall(name="calculator", description="Perform arithmetic calculations"),
    ToolCall(name="weather", description="Get current weather for a location"),
]
metric = ToolUseMetric(available_tools=tools, threshold=0.7)

The available_tools parameter defines the ground truth set of tools the agent has access to.
The metric will evaluate whether the agent selected appropriate tools from this set.

Example 2: Integration with Framework Instrumentation

Use with a LangChain callback handler for automatic tool use evaluation.

from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall
from deepeval.integrations.langchain import CallbackHandler

tools = [ToolCall(name="search", description="Search the web")]
metric = ToolUseMetric(available_tools=tools)
handler = CallbackHandler(metrics=[metric], name="search-agent")
agent.invoke({"input": "Find information about Python"}, config={"callbacks": [handler]})

Related Pages

Principle:Confident_ai_Deepeval_Tool_Use_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment