Implementation:Arize ai Phoenix ToolResponseHandlingEvaluator

Overview

ToolResponseHandlingEvaluator is an LLM-based classification evaluator in the arize-phoenix-evals package that determines whether an AI agent properly handled a tool's response. It extends ClassificationEvaluator and evaluates the agent's post-tool-call behavior, including error handling, data extraction, transformation, and safe information disclosure.

Description

The ToolResponseHandlingEvaluator focuses on what happens after a tool returns its result. It does not evaluate whether the correct tool was selected (see ToolSelectionEvaluator) or whether the tool was invoked correctly (see ToolInvocationEvaluator). Instead, it assesses how the agent processed the tool's output to produce its final response.

Key aspects evaluated include:

Data extraction -- Did the agent correctly extract information from the tool result?
Data transformation -- Did the agent properly format or transform the data for the user?
Error handling -- Did the agent handle tool errors or edge cases appropriately?
Information safety -- Did the agent avoid disclosing sensitive information?
Hallucination avoidance -- Did the agent refrain from adding information not present in the tool result?

The evaluator loads its configuration from TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.

Parameter	Type	Description
`llm`	`LLM`	The LLM instance to use as the judge for evaluation. Must support tool calling or structured output.

Usage

from phoenix.evals.metrics import ToolResponseHandlingEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = ToolResponseHandlingEvaluator(llm=llm)

Code Reference

Property	Value
Source File	packages/phoenix-evals/src/phoenix/evals/metrics/tool_response_handling.py
Module	`phoenix.evals.metrics.tool_response_handling`
Class	`ToolResponseHandlingEvaluator(ClassificationEvaluator)`
Lines	~101
Kind	`"llm"`
Direction	Loaded from config (maximize)
Domain	LLM Evaluation, Metrics, Agent Evaluation

Class Attributes

Attribute	Description
`NAME`	The evaluator name, loaded from `TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.name`.
`PROMPT`	A `PromptTemplate` built from the config's messages.
`CHOICES`	Classification labels (correct, incorrect) from the config.
`DIRECTION`	Optimization direction from the config.

Input Schema

Defined by the inner class ToolResponseHandlingInputSchema(BaseModel):

Field	Type	Description
`input`	`str`	The user query or conversation context.
`tool_call`	`str`	The tool invocation(s) made by the agent, including arguments.
`tool_result`	`str`	The tool's response (data, errors, or partial results).
`output`	`str`	The agent's handling after receiving the tool result (may include retries, follow-ups, or final response).

I/O Contract

Input

Field	Type	Required	Description
`input`	`str`	Yes	The user query or conversation context that triggered the tool call.
`tool_call`	`str`	Yes	The tool invocation(s) made by the agent, including arguments.
`tool_result`	`str`	Yes	The raw response from the tool (data, errors, or partial results).
`output`	`str`	Yes	The agent's final output or action after processing the tool result.

Output

Returns a list containing one Score object with the following fields:

Field	Description
`name`	The evaluator name (e.g., `"tool_response_handling"`).
`score`	`1.0` if correct, `0.0` if incorrect.
`label`	The classification label (`"correct"` or `"incorrect"`).
`explanation`	An explanation from the LLM judge.
`metadata`	Dictionary containing the model name used for evaluation.
`kind`	`"llm"`
`direction`	The optimization direction (maximize).

Usage Examples

Correct Data Extraction

from phoenix.evals.metrics.tool_response_handling import ToolResponseHandlingEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
tool_response_eval = ToolResponseHandlingEvaluator(llm=llm)

eval_input = {
    "input": "What's the weather in Seattle?",
    "tool_call": 'get_weather(location="Seattle")',
    "tool_result": '{"temperature": 58, "conditions": "cloudy"}',
    "output": "Seattle is currently 58 degrees F and cloudy.",
}
scores = tool_response_eval.evaluate(eval_input)
print(scores)
# Expected: score=1.0, label='correct'

Detecting Hallucinated Data

eval_input = {
    "input": "What restaurants are nearby?",
    "tool_call": 'search_restaurants(location="downtown")',
    "tool_result": '{"results": [{"name": "Cafe Luna", "rating": 4.2}]}',
    "output": "I found Cafe Luna and Mario's Italian nearby.",
}
scores = tool_response_eval.evaluate(eval_input)
# Expected: score=0.0, label='incorrect' -- Mario's Italian was hallucinated

Error Handling Evaluation

eval_input = {
    "input": "Find my recent orders",
    "tool_call": "get_orders(user_id='123')",
    "tool_result": '{"error": "rate_limit_exceeded", "retry_after": 30}',
    "output": "[Retried] Your order (ORD-001) has shipped.",
}
scores = tool_response_eval.evaluate(eval_input)
# Expected: score=0.0 -- agent fabricated order data from an error response

Related Pages

Arize_ai_Phoenix_ToolSelectionEvaluator -- Evaluates whether the correct tool was selected.
Arize_ai_Phoenix_ToolInvocationEvaluator -- Evaluates whether a tool was invoked correctly.
Arize_ai_Phoenix_FaithfulnessEvaluator -- Related concept of checking faithfulness to source material.
Arize_ai_Phoenix_Evals_Public_API -- The top-level phoenix.evals public API surface.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment