Implementation:Arize ai Phoenix ToolInvocationEvaluator
Overview
ToolInvocationEvaluator is an LLM-based classification evaluator in the arize-phoenix-evals package that determines whether an AI agent's tool invocation was performed correctly. It extends ClassificationEvaluator and uses a judge LLM to assess the correctness of tool arguments, formatting, and safety -- distinct from whether the correct tool was selected.
Description
The ToolInvocationEvaluator focuses specifically on the invocation quality of tool calls made by AI agents. It does not evaluate whether the right tool was chosen (see ToolSelectionEvaluator) or how the tool's response was handled (see ToolResponseHandlingEvaluator). Instead, it assesses:
Criteria for Correct Invocation:
- JSON is properly structured (if applicable).
- All required fields/parameters are present.
- No hallucinated or nonexistent fields (all fields exist in the tool schema).
- Argument values match the user query and schema expectations.
- No unsafe content (e.g., PII) in arguments.
Criteria for Incorrect Invocation:
- Hallucinated or nonexistent fields not in the schema.
- Missing required fields/parameters.
- Improperly formatted or malformed JSON.
- Incorrect, hallucinated, or mismatched argument values.
- Unsafe content (e.g., PII, sensitive data) in arguments.
The evaluator loads its configuration from TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.
| Parameter | Type | Description |
|---|---|---|
llm |
LLM |
The LLM instance to use as the judge for evaluation. Must support tool calling or structured output. |
Usage
from phoenix.evals.metrics import ToolInvocationEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = ToolInvocationEvaluator(llm=llm)
Code Reference
| Property | Value |
|---|---|
| Source File | packages/phoenix-evals/src/phoenix/evals/metrics/tool_invocation.py |
| Module | phoenix.evals.metrics.tool_invocation
|
| Class | ToolInvocationEvaluator(ClassificationEvaluator)
|
| Lines | ~121 |
| Kind | "llm"
|
| Direction | Loaded from config (maximize) |
| Domain | LLM Evaluation, Metrics, Agent Evaluation |
Class Attributes
| Attribute | Description |
|---|---|
NAME |
The evaluator name, loaded from TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.name.
|
PROMPT |
A PromptTemplate built from the config's messages.
|
CHOICES |
Classification labels (correct, incorrect) from the config. |
DIRECTION |
Optimization direction from the config. |
Input Schema
Defined by the inner class ToolInvocationInputSchema(BaseModel):
| Field | Type | Description |
|---|---|---|
input |
str |
The input query or conversation context. |
available_tools |
str |
The available tool schemas, either as JSON schema or human-readable format. |
tool_selection |
str |
The tool invocation(s) made by the LLM, including arguments. |
I/O Contract
Input
| Field | Type | Required | Description |
|---|---|---|---|
input |
str |
Yes | The user query or conversation context that triggered the tool call. |
available_tools |
str |
Yes | Description of available tools and their schemas. Accepts JSON schema format or human-readable descriptions. |
tool_selection |
str |
Yes | The tool invocation(s) made by the agent, including the function name and arguments. |
Output
Returns a list containing one Score object with the following fields:
| Field | Description |
|---|---|
name |
The evaluator name (e.g., "tool_invocation").
|
score |
1.0 if correct, 0.0 if incorrect.
|
label |
The classification label ("correct" or "incorrect").
|
explanation |
An explanation from the LLM judge. |
metadata |
Dictionary containing the model name used for evaluation. |
kind |
"llm"
|
direction |
The optimization direction (maximize). |
Usage Examples
JSON Schema Format for Available Tools
from phoenix.evals.metrics.tool_invocation import ToolInvocationEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
tool_invocation_eval = ToolInvocationEvaluator(llm=llm)
eval_input = {
"input": "User: Book a flight from NYC to LA for tomorrow",
"available_tools": '''
{
"name": "book_flight",
"description": "Book a flight between two cities",
"parameters": {
"type": "object",
"properties": {
"origin": {"type": "string", "description": "Departure city code"},
"destination": {"type": "string", "description": "Arrival city code"},
"date": {"type": "string", "description": "Flight date in YYYY-MM-DD"}
},
"required": ["origin", "destination", "date"]
}
}
''',
"tool_selection": 'book_flight(origin="NYC", destination="LA", date="2024-01-15")',
}
scores = tool_invocation_eval.evaluate(eval_input)
print(scores)
Human-Readable Format for Available Tools
eval_input = {
"input": "User: What's the weather in San Francisco?",
"available_tools": '''
WeatherTool:
Description: Get the current weather for a location
Parameters:
- location (required): The city name or coordinates
- units (optional): Temperature units (celsius or fahrenheit)
''',
"tool_selection": "WeatherTool(location='San Francisco', units='fahrenheit')",
}
scores = tool_invocation_eval.evaluate(eval_input)
print(scores)
Related Pages
- Arize_ai_Phoenix_ToolSelectionEvaluator -- Evaluates whether the correct tool was selected (complementary to invocation evaluation).
- Arize_ai_Phoenix_ToolResponseHandlingEvaluator -- Evaluates how the agent handled the tool's response.
- Arize_ai_Phoenix_CorrectnessEvaluator -- General-purpose LLM-based correctness evaluation.
- Arize_ai_Phoenix_Evals_Public_API -- The top-level
phoenix.evalspublic API surface.