Implementation:Arize ai Phoenix HallucinationEvaluator

Overview

HallucinationEvaluator is a deprecated LLM-based classification evaluator in the arize-phoenix-evals package that detects hallucinations in grounded LLM responses. It extends ClassificationEvaluator and classifies responses as factual or hallucinated relative to a reference context. Users should migrate to FaithfulnessEvaluator, which uses updated terminology and scoring conventions.

Description

The HallucinationEvaluator is maintained for backwards compatibility. It uses an LLM judge to evaluate whether a model's output contains hallucinated information not supported by the provided context.

Deprecation Notice: This evaluator emits a DeprecationWarning upon instantiation. The warning message directs users to use FaithfulnessEvaluator instead, which differs in the following ways:

Aspect	HallucinationEvaluator (deprecated)	FaithfulnessEvaluator (recommended)
Labels	`"factual"` / `"hallucinated"`	`"faithful"` / `"unfaithful"`
Score semantics	`1.0 = hallucinated`, `0.0 = factual`	`1.0 = faithful`, `0.0 = unfaithful`
Direction	`"minimize"`	`"maximize"`

The evaluator loads its configuration from HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG, including the prompt template, classification choices, and optimization direction.

Parameter	Type	Description
`llm`	`LLM`	The LLM instance to use as the judge for evaluation. Must support tool calling or structured output.

Usage

from phoenix.evals.metrics import HallucinationEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
# Warning: emits DeprecationWarning on instantiation
evaluator = HallucinationEvaluator(llm=llm)

Code Reference

Property	Value
Source File	packages/phoenix-evals/src/phoenix/evals/metrics/hallucination.py
Module	`phoenix.evals.metrics.hallucination`
Class	`HallucinationEvaluator(ClassificationEvaluator)`
Lines	~90
Kind	`"llm"`
Direction	`"minimize"`
Status	Deprecated -- use FaithfulnessEvaluator instead
Domain	LLM Evaluation, Metrics

Class Attributes

Attribute	Description
`NAME`	The evaluator name, loaded from `HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG.name`.
`PROMPT`	A `PromptTemplate` built from the config's messages.
`CHOICES`	Classification labels (factual, hallucinated) from the config.
`DIRECTION`	Optimization direction from the config (`"minimize"`).

Input Schema

Defined by the inner class HallucinationInputSchema(BaseModel):

Field	Type	Description
`input`	`str`	The input query.
`output`	`str`	The response to the query.
`context`	`str`	The context or reference text.

Constructor Behavior

The __init__ method issues a DeprecationWarning via the warnings module before delegating to the parent ClassificationEvaluator.__init__:

warnings.warn(
    "HallucinationEvaluator is deprecated and will be removed in a future version. "
    "Please use FaithfulnessEvaluator instead. The new evaluator uses "
    "'faithful'/'unfaithful' labels and maximizes score (1.0=faithful) instead of "
    "minimizing it (0.0=factual).",
    DeprecationWarning,
    stacklevel=2,
)

I/O Contract

Input

Field	Type	Required	Description
`input`	`str`	Yes	The original query or question.
`output`	`str`	Yes	The model's response to be evaluated.
`context`	`str`	Yes	The reference context that the response should be grounded in.

Output

Returns a list containing one Score object with the following fields:

Field	Description
`name`	`"hallucination"`
`score`	`0.0` if factual, `1.0` if hallucinated.
`label`	The classification label (`"factual"` or `"hallucinated"`).
`explanation`	An explanation from the LLM judge.
`metadata`	Dictionary containing the model name used for evaluation.
`kind`	`"llm"`
`direction`	`"minimize"`

Usage Examples

Detecting a Factual Response

from phoenix.evals.metrics.hallucination import HallucinationEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
hallucination_eval = HallucinationEvaluator(llm=llm)  # emits DeprecationWarning

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France.",
}
scores = hallucination_eval.evaluate(eval_input)
print(scores)
# [Score(name='hallucination', score=0.0, label='factual',
#     explanation='Information is supported by context',
#     metadata={'model': 'gpt-4o-mini'},
#     kind="llm", direction="minimize")]

Migration to FaithfulnessEvaluator

# Before (deprecated):
from phoenix.evals.metrics import HallucinationEvaluator
evaluator = HallucinationEvaluator(llm=llm)

# After (recommended):
from phoenix.evals.metrics import FaithfulnessEvaluator
evaluator = FaithfulnessEvaluator(llm=llm)

Related Pages

Principle:Arize_ai_Phoenix_Evaluator_Design
Heuristic:Arize_ai_Phoenix_Warning_Deprecated_HallucinationEvaluator
Arize_ai_Phoenix_FaithfulnessEvaluator -- The recommended replacement evaluator.
Arize_ai_Phoenix_CorrectnessEvaluator -- LLM-based correctness evaluation.
Arize_ai_Phoenix_DocumentRelevanceEvaluator -- LLM-based document relevance evaluation.
Arize_ai_Phoenix_Evals_Public_API -- The top-level phoenix.evals public API surface.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment