Implementation:Confident ai Deepeval ToolUseMetric
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-14 09:00 GMT |
Overview
Concrete evaluation metric class that measures whether an AI agent selects and uses appropriate tools during conversational interactions. The ToolUseMetric evaluates tool selection correctness and argument accuracy against a provided list of available tools, using an LLM-as-judge approach.
Description
The ToolUseMetric is a conversational metric that evaluates tool use across one or more turns of agent interaction. It requires a list of available tools (defined as ToolCall objects) that represents the ground truth set of tools the agent has access to. The metric then assesses whether the agent's actual tool invocations were appropriate for the given user inputs.
Key capabilities:
- Tool selection assessment -- evaluates whether the agent chose the correct tool(s) from the available set.
- Argument accuracy evaluation -- assesses whether tool call arguments were correct and well-formed.
- Conversational context -- as a
BaseConversationalMetric, it evaluates tool use across the full conversation rather than individual turns. - Reason generation -- produces human-readable explanations of tool use quality.
Usage
Import and instantiate with available tools:
from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall
Code Reference
Source Location
- Repository:
confident-ai/deepeval - File:
deepeval/metrics/tool_use/tool_use.py(lines 31--437)
Signature
class ToolUseMetric(BaseConversationalMetric):
def __init__(
self,
available_tools: List[ToolCall],
threshold: float = 0.5,
model: Optional[str] = None,
include_reason: bool = True,
async_mode: bool = True,
strict_mode: bool = False,
verbose_mode: bool = False,
):
...
Import
from deepeval.metrics import ToolUseMetric
Parent Class
BaseConversationalMetric
I/O Contract
Inputs (Constructor Parameters)
| Name | Type | Default | Description |
|---|---|---|---|
available_tools |
List[ToolCall] | REQUIRED | List of available tools the agent can use. Each ToolCall includes a name and description. This serves as ground truth for tool selection evaluation.
|
threshold |
float | 0.5 |
Minimum score (0--1) for the evaluation to pass. |
model |
Optional[str] | None |
LLM model to use as the evaluation judge. |
include_reason |
bool | True |
Whether to generate a human-readable reason for the score. |
async_mode |
bool | True |
Whether to run evaluation asynchronously. |
strict_mode |
bool | False |
When enabled, scores are binarized to 0 or 1 based on the threshold. |
verbose_mode |
bool | False |
When enabled, prints detailed evaluation information during execution. |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | A value between 0 and 1 indicating tool use quality (selection correctness and argument accuracy). |
| reason | Optional[str] | Human-readable explanation of the score (when include_reason=True).
|
| success | bool | Whether the score meets or exceeds the threshold. |
Usage Examples
Example 1: Basic Tool Use Evaluation
Create a metric with available tools for evaluating agent tool selection.
from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall
tools = [
ToolCall(name="search", description="Search the web"),
ToolCall(name="calculator", description="Perform arithmetic calculations"),
ToolCall(name="weather", description="Get current weather for a location"),
]
metric = ToolUseMetric(available_tools=tools, threshold=0.7)
- The
available_toolsparameter defines the ground truth set of tools the agent has access to. - The metric will evaluate whether the agent selected appropriate tools from this set.
Example 2: Integration with Framework Instrumentation
Use with a LangChain callback handler for automatic tool use evaluation.
from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall
from deepeval.integrations.langchain import CallbackHandler
tools = [ToolCall(name="search", description="Search the web")]
metric = ToolUseMetric(available_tools=tools)
handler = CallbackHandler(metrics=[metric], name="search-agent")
agent.invoke({"input": "Find information about Python"}, config={"callbacks": [handler]})