Implementation:Arize ai Phoenix ToolResponseHandlingEvaluator
Overview
ToolResponseHandlingEvaluator is an LLM-based classification evaluator in the arize-phoenix-evals package that determines whether an AI agent properly handled a tool's response. It extends ClassificationEvaluator and evaluates the agent's post-tool-call behavior, including error handling, data extraction, transformation, and safe information disclosure.
Description
The ToolResponseHandlingEvaluator focuses on what happens after a tool returns its result. It does not evaluate whether the correct tool was selected (see ToolSelectionEvaluator) or whether the tool was invoked correctly (see ToolInvocationEvaluator). Instead, it assesses how the agent processed the tool's output to produce its final response.
Key aspects evaluated include:
- Data extraction -- Did the agent correctly extract information from the tool result?
- Data transformation -- Did the agent properly format or transform the data for the user?
- Error handling -- Did the agent handle tool errors or edge cases appropriately?
- Information safety -- Did the agent avoid disclosing sensitive information?
- Hallucination avoidance -- Did the agent refrain from adding information not present in the tool result?
The evaluator loads its configuration from TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.
| Parameter | Type | Description |
|---|---|---|
llm |
LLM |
The LLM instance to use as the judge for evaluation. Must support tool calling or structured output. |
Usage
from phoenix.evals.metrics import ToolResponseHandlingEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = ToolResponseHandlingEvaluator(llm=llm)
Code Reference
| Property | Value |
|---|---|
| Source File | packages/phoenix-evals/src/phoenix/evals/metrics/tool_response_handling.py |
| Module | phoenix.evals.metrics.tool_response_handling
|
| Class | ToolResponseHandlingEvaluator(ClassificationEvaluator)
|
| Lines | ~101 |
| Kind | "llm"
|
| Direction | Loaded from config (maximize) |
| Domain | LLM Evaluation, Metrics, Agent Evaluation |
Class Attributes
| Attribute | Description |
|---|---|
NAME |
The evaluator name, loaded from TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.name.
|
PROMPT |
A PromptTemplate built from the config's messages.
|
CHOICES |
Classification labels (correct, incorrect) from the config. |
DIRECTION |
Optimization direction from the config. |
Input Schema
Defined by the inner class ToolResponseHandlingInputSchema(BaseModel):
| Field | Type | Description |
|---|---|---|
input |
str |
The user query or conversation context. |
tool_call |
str |
The tool invocation(s) made by the agent, including arguments. |
tool_result |
str |
The tool's response (data, errors, or partial results). |
output |
str |
The agent's handling after receiving the tool result (may include retries, follow-ups, or final response). |
I/O Contract
Input
| Field | Type | Required | Description |
|---|---|---|---|
input |
str |
Yes | The user query or conversation context that triggered the tool call. |
tool_call |
str |
Yes | The tool invocation(s) made by the agent, including arguments. |
tool_result |
str |
Yes | The raw response from the tool (data, errors, or partial results). |
output |
str |
Yes | The agent's final output or action after processing the tool result. |
Output
Returns a list containing one Score object with the following fields:
| Field | Description |
|---|---|
name |
The evaluator name (e.g., "tool_response_handling").
|
score |
1.0 if correct, 0.0 if incorrect.
|
label |
The classification label ("correct" or "incorrect").
|
explanation |
An explanation from the LLM judge. |
metadata |
Dictionary containing the model name used for evaluation. |
kind |
"llm"
|
direction |
The optimization direction (maximize). |
Usage Examples
Correct Data Extraction
from phoenix.evals.metrics.tool_response_handling import ToolResponseHandlingEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
tool_response_eval = ToolResponseHandlingEvaluator(llm=llm)
eval_input = {
"input": "What's the weather in Seattle?",
"tool_call": 'get_weather(location="Seattle")',
"tool_result": '{"temperature": 58, "conditions": "cloudy"}',
"output": "Seattle is currently 58 degrees F and cloudy.",
}
scores = tool_response_eval.evaluate(eval_input)
print(scores)
# Expected: score=1.0, label='correct'
Detecting Hallucinated Data
eval_input = {
"input": "What restaurants are nearby?",
"tool_call": 'search_restaurants(location="downtown")',
"tool_result": '{"results": [{"name": "Cafe Luna", "rating": 4.2}]}',
"output": "I found Cafe Luna and Mario's Italian nearby.",
}
scores = tool_response_eval.evaluate(eval_input)
# Expected: score=0.0, label='incorrect' -- Mario's Italian was hallucinated
Error Handling Evaluation
eval_input = {
"input": "Find my recent orders",
"tool_call": "get_orders(user_id='123')",
"tool_result": '{"error": "rate_limit_exceeded", "retry_after": 30}',
"output": "[Retried] Your order (ORD-001) has shipped.",
}
scores = tool_response_eval.evaluate(eval_input)
# Expected: score=0.0 -- agent fabricated order data from an error response
Related Pages
- Arize_ai_Phoenix_ToolSelectionEvaluator -- Evaluates whether the correct tool was selected.
- Arize_ai_Phoenix_ToolInvocationEvaluator -- Evaluates whether a tool was invoked correctly.
- Arize_ai_Phoenix_FaithfulnessEvaluator -- Related concept of checking faithfulness to source material.
- Arize_ai_Phoenix_Evals_Public_API -- The top-level
phoenix.evalspublic API surface.