Implementation:Arize ai Phoenix HallucinationEvaluator
Overview
HallucinationEvaluator is a deprecated LLM-based classification evaluator in the arize-phoenix-evals package that detects hallucinations in grounded LLM responses. It extends ClassificationEvaluator and classifies responses as factual or hallucinated relative to a reference context. Users should migrate to FaithfulnessEvaluator, which uses updated terminology and scoring conventions.
Description
The HallucinationEvaluator is maintained for backwards compatibility. It uses an LLM judge to evaluate whether a model's output contains hallucinated information not supported by the provided context.
Deprecation Notice: This evaluator emits a DeprecationWarning upon instantiation. The warning message directs users to use FaithfulnessEvaluator instead, which differs in the following ways:
| Aspect | HallucinationEvaluator (deprecated) | FaithfulnessEvaluator (recommended) |
|---|---|---|
| Labels | "factual" / "hallucinated" |
"faithful" / "unfaithful"
|
| Score semantics | 1.0 = hallucinated, 0.0 = factual |
1.0 = faithful, 0.0 = unfaithful
|
| Direction | "minimize" |
"maximize"
|
The evaluator loads its configuration from HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG, including the prompt template, classification choices, and optimization direction.
| Parameter | Type | Description |
|---|---|---|
llm |
LLM |
The LLM instance to use as the judge for evaluation. Must support tool calling or structured output. |
Usage
from phoenix.evals.metrics import HallucinationEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
# Warning: emits DeprecationWarning on instantiation
evaluator = HallucinationEvaluator(llm=llm)
Code Reference
| Property | Value |
|---|---|
| Source File | packages/phoenix-evals/src/phoenix/evals/metrics/hallucination.py |
| Module | phoenix.evals.metrics.hallucination
|
| Class | HallucinationEvaluator(ClassificationEvaluator)
|
| Lines | ~90 |
| Kind | "llm"
|
| Direction | "minimize"
|
| Status | Deprecated -- use FaithfulnessEvaluator instead |
| Domain | LLM Evaluation, Metrics |
Class Attributes
| Attribute | Description |
|---|---|
NAME |
The evaluator name, loaded from HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG.name.
|
PROMPT |
A PromptTemplate built from the config's messages.
|
CHOICES |
Classification labels (factual, hallucinated) from the config. |
DIRECTION |
Optimization direction from the config ("minimize").
|
Input Schema
Defined by the inner class HallucinationInputSchema(BaseModel):
| Field | Type | Description |
|---|---|---|
input |
str |
The input query. |
output |
str |
The response to the query. |
context |
str |
The context or reference text. |
Constructor Behavior
The __init__ method issues a DeprecationWarning via the warnings module before delegating to the parent ClassificationEvaluator.__init__:
warnings.warn(
"HallucinationEvaluator is deprecated and will be removed in a future version. "
"Please use FaithfulnessEvaluator instead. The new evaluator uses "
"'faithful'/'unfaithful' labels and maximizes score (1.0=faithful) instead of "
"minimizing it (0.0=factual).",
DeprecationWarning,
stacklevel=2,
)
I/O Contract
Input
| Field | Type | Required | Description |
|---|---|---|---|
input |
str |
Yes | The original query or question. |
output |
str |
Yes | The model's response to be evaluated. |
context |
str |
Yes | The reference context that the response should be grounded in. |
Output
Returns a list containing one Score object with the following fields:
| Field | Description |
|---|---|
name |
"hallucination"
|
score |
0.0 if factual, 1.0 if hallucinated.
|
label |
The classification label ("factual" or "hallucinated").
|
explanation |
An explanation from the LLM judge. |
metadata |
Dictionary containing the model name used for evaluation. |
kind |
"llm"
|
direction |
"minimize"
|
Usage Examples
Detecting a Factual Response
from phoenix.evals.metrics.hallucination import HallucinationEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
hallucination_eval = HallucinationEvaluator(llm=llm) # emits DeprecationWarning
eval_input = {
"input": "What is the capital of France?",
"output": "Paris is the capital of France.",
"context": "Paris is the capital and largest city of France.",
}
scores = hallucination_eval.evaluate(eval_input)
print(scores)
# [Score(name='hallucination', score=0.0, label='factual',
# explanation='Information is supported by context',
# metadata={'model': 'gpt-4o-mini'},
# kind="llm", direction="minimize")]
Migration to FaithfulnessEvaluator
# Before (deprecated):
from phoenix.evals.metrics import HallucinationEvaluator
evaluator = HallucinationEvaluator(llm=llm)
# After (recommended):
from phoenix.evals.metrics import FaithfulnessEvaluator
evaluator = FaithfulnessEvaluator(llm=llm)
Related Pages
- Principle:Arize_ai_Phoenix_Evaluator_Design
- Heuristic:Arize_ai_Phoenix_Warning_Deprecated_HallucinationEvaluator
- Arize_ai_Phoenix_FaithfulnessEvaluator -- The recommended replacement evaluator.
- Arize_ai_Phoenix_CorrectnessEvaluator -- LLM-based correctness evaluation.
- Arize_ai_Phoenix_DocumentRelevanceEvaluator -- LLM-based document relevance evaluation.
- Arize_ai_Phoenix_Evals_Public_API -- The top-level
phoenix.evalspublic API surface.