Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix ToolResponseHandlingEvaluator

From Leeroopedia
Revision as of 12:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Arize_ai_Phoenix_ToolResponseHandlingEvaluator.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

ToolResponseHandlingEvaluator is an LLM-based classification evaluator in the arize-phoenix-evals package that determines whether an AI agent properly handled a tool's response. It extends ClassificationEvaluator and evaluates the agent's post-tool-call behavior, including error handling, data extraction, transformation, and safe information disclosure.

Description

The ToolResponseHandlingEvaluator focuses on what happens after a tool returns its result. It does not evaluate whether the correct tool was selected (see ToolSelectionEvaluator) or whether the tool was invoked correctly (see ToolInvocationEvaluator). Instead, it assesses how the agent processed the tool's output to produce its final response.

Key aspects evaluated include:

  • Data extraction -- Did the agent correctly extract information from the tool result?
  • Data transformation -- Did the agent properly format or transform the data for the user?
  • Error handling -- Did the agent handle tool errors or edge cases appropriately?
  • Information safety -- Did the agent avoid disclosing sensitive information?
  • Hallucination avoidance -- Did the agent refrain from adding information not present in the tool result?

The evaluator loads its configuration from TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.

Parameter Type Description
llm LLM The LLM instance to use as the judge for evaluation. Must support tool calling or structured output.

Usage

from phoenix.evals.metrics import ToolResponseHandlingEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = ToolResponseHandlingEvaluator(llm=llm)

Code Reference

Property Value
Source File packages/phoenix-evals/src/phoenix/evals/metrics/tool_response_handling.py
Module phoenix.evals.metrics.tool_response_handling
Class ToolResponseHandlingEvaluator(ClassificationEvaluator)
Lines ~101
Kind "llm"
Direction Loaded from config (maximize)
Domain LLM Evaluation, Metrics, Agent Evaluation

Class Attributes

Attribute Description
NAME The evaluator name, loaded from TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.name.
PROMPT A PromptTemplate built from the config's messages.
CHOICES Classification labels (correct, incorrect) from the config.
DIRECTION Optimization direction from the config.

Input Schema

Defined by the inner class ToolResponseHandlingInputSchema(BaseModel):

Field Type Description
input str The user query or conversation context.
tool_call str The tool invocation(s) made by the agent, including arguments.
tool_result str The tool's response (data, errors, or partial results).
output str The agent's handling after receiving the tool result (may include retries, follow-ups, or final response).

I/O Contract

Input

Field Type Required Description
input str Yes The user query or conversation context that triggered the tool call.
tool_call str Yes The tool invocation(s) made by the agent, including arguments.
tool_result str Yes The raw response from the tool (data, errors, or partial results).
output str Yes The agent's final output or action after processing the tool result.

Output

Returns a list containing one Score object with the following fields:

Field Description
name The evaluator name (e.g., "tool_response_handling").
score 1.0 if correct, 0.0 if incorrect.
label The classification label ("correct" or "incorrect").
explanation An explanation from the LLM judge.
metadata Dictionary containing the model name used for evaluation.
kind "llm"
direction The optimization direction (maximize).

Usage Examples

Correct Data Extraction

from phoenix.evals.metrics.tool_response_handling import ToolResponseHandlingEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
tool_response_eval = ToolResponseHandlingEvaluator(llm=llm)

eval_input = {
    "input": "What's the weather in Seattle?",
    "tool_call": 'get_weather(location="Seattle")',
    "tool_result": '{"temperature": 58, "conditions": "cloudy"}',
    "output": "Seattle is currently 58 degrees F and cloudy.",
}
scores = tool_response_eval.evaluate(eval_input)
print(scores)
# Expected: score=1.0, label='correct'

Detecting Hallucinated Data

eval_input = {
    "input": "What restaurants are nearby?",
    "tool_call": 'search_restaurants(location="downtown")',
    "tool_result": '{"results": [{"name": "Cafe Luna", "rating": 4.2}]}',
    "output": "I found Cafe Luna and Mario's Italian nearby.",
}
scores = tool_response_eval.evaluate(eval_input)
# Expected: score=0.0, label='incorrect' -- Mario's Italian was hallucinated

Error Handling Evaluation

eval_input = {
    "input": "Find my recent orders",
    "tool_call": "get_orders(user_id='123')",
    "tool_result": '{"error": "rate_limit_exceeded", "retry_after": 30}',
    "output": "[Retried] Your order (ORD-001) has shipped.",
}
scores = tool_response_eval.evaluate(eval_input)
# Expected: score=0.0 -- agent fabricated order data from an error response

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment