Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Confident ai Deepeval ToolUseMetric

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 09:00 GMT

Overview

Concrete evaluation metric class that measures whether an AI agent selects and uses appropriate tools during conversational interactions. The ToolUseMetric evaluates tool selection correctness and argument accuracy against a provided list of available tools, using an LLM-as-judge approach.

Description

The ToolUseMetric is a conversational metric that evaluates tool use across one or more turns of agent interaction. It requires a list of available tools (defined as ToolCall objects) that represents the ground truth set of tools the agent has access to. The metric then assesses whether the agent's actual tool invocations were appropriate for the given user inputs.

Key capabilities:

  • Tool selection assessment -- evaluates whether the agent chose the correct tool(s) from the available set.
  • Argument accuracy evaluation -- assesses whether tool call arguments were correct and well-formed.
  • Conversational context -- as a BaseConversationalMetric, it evaluates tool use across the full conversation rather than individual turns.
  • Reason generation -- produces human-readable explanations of tool use quality.

Usage

Import and instantiate with available tools:

from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall

Code Reference

Source Location

  • Repository: confident-ai/deepeval
  • File: deepeval/metrics/tool_use/tool_use.py (lines 31--437)

Signature

class ToolUseMetric(BaseConversationalMetric):
    def __init__(
        self,
        available_tools: List[ToolCall],
        threshold: float = 0.5,
        model: Optional[str] = None,
        include_reason: bool = True,
        async_mode: bool = True,
        strict_mode: bool = False,
        verbose_mode: bool = False,
    ):
        ...

Import

from deepeval.metrics import ToolUseMetric

Parent Class

  • BaseConversationalMetric

I/O Contract

Inputs (Constructor Parameters)

Input Contract
Name Type Default Description
available_tools List[ToolCall] REQUIRED List of available tools the agent can use. Each ToolCall includes a name and description. This serves as ground truth for tool selection evaluation.
threshold float 0.5 Minimum score (0--1) for the evaluation to pass.
model Optional[str] None LLM model to use as the evaluation judge.
include_reason bool True Whether to generate a human-readable reason for the score.
async_mode bool True Whether to run evaluation asynchronously.
strict_mode bool False When enabled, scores are binarized to 0 or 1 based on the threshold.
verbose_mode bool False When enabled, prints detailed evaluation information during execution.

Outputs

Output Contract
Name Type Description
score float A value between 0 and 1 indicating tool use quality (selection correctness and argument accuracy).
reason Optional[str] Human-readable explanation of the score (when include_reason=True).
success bool Whether the score meets or exceeds the threshold.

Usage Examples

Example 1: Basic Tool Use Evaluation

Create a metric with available tools for evaluating agent tool selection.

from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall

tools = [
    ToolCall(name="search", description="Search the web"),
    ToolCall(name="calculator", description="Perform arithmetic calculations"),
    ToolCall(name="weather", description="Get current weather for a location"),
]
metric = ToolUseMetric(available_tools=tools, threshold=0.7)
  • The available_tools parameter defines the ground truth set of tools the agent has access to.
  • The metric will evaluate whether the agent selected appropriate tools from this set.

Example 2: Integration with Framework Instrumentation

Use with a LangChain callback handler for automatic tool use evaluation.

from deepeval.metrics import ToolUseMetric
from deepeval.test_case import ToolCall
from deepeval.integrations.langchain import CallbackHandler

tools = [ToolCall(name="search", description="Search the web")]
metric = ToolUseMetric(available_tools=tools)
handler = CallbackHandler(metrics=[metric], name="search-agent")
agent.invoke({"input": "Find information about Python"}, config={"callbacks": [handler]})

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment