Implementation:Arize ai Phoenix ToolSelectionEvaluator

Overview

ToolSelectionEvaluator is an LLM-based classification evaluator in the arize-phoenix-evals package that determines whether an AI agent selected the correct tool for a given context. It extends ClassificationEvaluator and uses a judge LLM to assess the appropriateness of tool selection from a set of available tools.

Description

The ToolSelectionEvaluator focuses on the tool choice aspect of agent evaluation. It does not evaluate whether the tool was invoked correctly (see ToolInvocationEvaluator) or how the tool's response was handled (see ToolResponseHandlingEvaluator). Instead, it assesses whether the agent chose the right tool(s) from the available options based on the conversation context.

Key evaluation aspects:

Was the selected tool appropriate for the user's query?
Did the agent select the most relevant tool when multiple options were available?
If multiple tools were selected, were all selections justified by the context?

The evaluator loads its configuration from TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.

Parameter	Type	Description
`llm`	`LLM`	The LLM instance to use as the judge for evaluation. Must support tool calling or structured output.

Usage

from phoenix.evals.metrics import ToolSelectionEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = ToolSelectionEvaluator(llm=llm)

Code Reference

Property	Value
Source File	packages/phoenix-evals/src/phoenix/evals/metrics/tool_selection.py
Module	`phoenix.evals.metrics.tool_selection`
Class	`ToolSelectionEvaluator(ClassificationEvaluator)`
Lines	~72
Kind	`"llm"`
Direction	Loaded from config (maximize)
Domain	LLM Evaluation, Metrics, Agent Evaluation

Class Attributes

Attribute	Description
`NAME`	The evaluator name, loaded from `TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.name`.
`PROMPT`	A `PromptTemplate` built from the config's messages.
`CHOICES`	Classification labels (correct, incorrect) from the config.
`DIRECTION`	Optimization direction from the config.

Input Schema

Defined by the inner class ToolSelectionInputSchema(BaseModel):

Field	Type	Description
`input`	`str`	The input query or conversation.
`available_tools`	`str`	A list of available tools that the LLM could use.
`tool_selection`	`str`	The tool or tools selected by the LLM.

I/O Contract

Input

Field	Type	Required	Description
`input`	`str`	Yes	The user query or conversation context.
`available_tools`	`str`	Yes	A description of available tools (names and descriptions).
`tool_selection`	`str`	Yes	The tool(s) selected by the agent. Can be a single tool or multiple tools. Input arguments are optional.

Output

Returns a list containing one Score object with the following fields:

Field	Description
`name`	The evaluator name (e.g., `"tool_selection"`).
`score`	`1.0` if correct, `0.0` if incorrect.
`label`	The classification label (`"correct"` or `"incorrect"`).
`explanation`	An explanation from the LLM judge.
`metadata`	Dictionary containing the model name used for evaluation.
`kind`	`"llm"`
`direction`	The optimization direction (maximize).

Usage Examples

Correct Tool Selection

from phoenix.evals.metrics.tool_selection import ToolSelectionEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
tool_selection_eval = ToolSelectionEvaluator(llm=llm)

eval_input = {
    "input": "User: What is the weather in San Francisco?",
    "available_tools": (
        "WeatherTool: Get the current weather for a location.\n"
        "NewsTool: Stay connected to global events with up-to-date news.\n"
        "MusicTool: Create playlists, search for music, and check trends."
    ),
    "tool_selection": "WeatherTool(location='San Francisco')",
}
scores = tool_selection_eval.evaluate(eval_input)
print(scores)
# Expected: score=1.0, label='correct'

Incorrect Tool Selection

eval_input = {
    "input": "User: What is the weather in San Francisco?",
    "available_tools": (
        "WeatherTool: Get the current weather for a location.\n"
        "NewsTool: Stay connected to global events with up-to-date news.\n"
        "MusicTool: Create playlists, search for music, and check trends."
    ),
    "tool_selection": "NewsTool(query='San Francisco weather')",
}
scores = tool_selection_eval.evaluate(eval_input)
# Expected: score=0.0, label='incorrect'

Tool Selection Without Arguments

eval_input = {
    "input": "User: Play some jazz music",
    "available_tools": (
        "WeatherTool: Get the current weather for a location.\n"
        "MusicTool: Create playlists, search for music, and check trends."
    ),
    "tool_selection": "MusicTool",  # input arguments are optional
}
scores = tool_selection_eval.evaluate(eval_input)
# Expected: score=1.0, label='correct'

Related Pages

Arize_ai_Phoenix_ToolInvocationEvaluator -- Evaluates the correctness of tool invocation arguments and formatting.
Arize_ai_Phoenix_ToolResponseHandlingEvaluator -- Evaluates how the agent handled the tool's response.
Arize_ai_Phoenix_CorrectnessEvaluator -- General-purpose LLM-based correctness evaluation.
Arize_ai_Phoenix_Evals_Public_API -- The top-level phoenix.evals public API surface.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment