Implementation:Arize ai Phoenix CorrectnessEvaluator

Overview

CorrectnessEvaluator is an LLM-based classification evaluator in the arize-phoenix-evals package that assesses the factual accuracy and completeness of model outputs. It extends ClassificationEvaluator and uses a judge LLM to determine whether a response to a given input is correct or incorrect.

Description

The CorrectnessEvaluator delegates evaluation to an LLM judge by sending the input query and model output through a prompt template loaded from a pre-generated classification evaluator configuration (CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG). The LLM classifies the output into one of two categories and returns a score with an explanation.

The evaluator is configured with:

NAME -- Loaded from the generated config, identifying the evaluator as "correctness".
PROMPT -- A PromptTemplate constructed from the config's message templates.
CHOICES -- The classification labels (correct / incorrect) from the config.
DIRECTION -- The optimization direction from the config (maximize, since higher scores indicate correctness).

Parameter	Type	Description
`llm`	`LLM`	The LLM instance to use as the judge for evaluation. Must support tool calling or structured output.

Usage

from phoenix.evals.metrics import CorrectnessEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CorrectnessEvaluator(llm=llm)

Code Reference

Property	Value
Source File	packages/phoenix-evals/src/phoenix/evals/metrics/correctness.py
Module	`phoenix.evals.metrics.correctness`
Class	`CorrectnessEvaluator(ClassificationEvaluator)`
Lines	~65
Kind	`"llm"`
Direction	`"maximize"`
Domain	LLM Evaluation, Metrics

Class Attributes

Attribute	Description
`NAME`	The evaluator name, loaded from `CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.name`.
`PROMPT`	A `PromptTemplate` built from the config's messages.
`CHOICES`	Classification labels from the config (e.g., correct, incorrect).
`DIRECTION`	Optimization direction from the config.

Input Schema

Defined by the inner class CorrectnessInputSchema(BaseModel):

Field	Type	Description
`input`	`str`	The input query or question.
`output`	`str`	The response to evaluate for correctness.

I/O Contract

Input

Field	Type	Required	Description
`input`	`str`	Yes	The original query or question posed to the model.
`output`	`str`	Yes	The model's response to be evaluated.

Output

Returns a list containing one Score object with the following fields:

Field	Description
`name`	`"correctness"`
`score`	`1.0` if correct, `0.0` if incorrect.
`label`	The classification label (e.g., `"correct"` or `"incorrect"`).
`explanation`	An explanation from the LLM judge describing its reasoning.
`metadata`	Dictionary containing the model name used for evaluation.
`kind`	`"llm"`
`direction`	`"maximize"`

Usage Examples

Basic Correctness Evaluation

from phoenix.evals.metrics.correctness import CorrectnessEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
correctness_eval = CorrectnessEvaluator(llm=llm)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
}
scores = correctness_eval.evaluate(eval_input)
print(scores)
# [Score(name='correctness', score=1.0, label='correct',
#     explanation='The response accurately states that Paris is the capital of France.',
#     metadata={'model': 'gpt-4o-mini'},
#     kind="llm", direction="maximize")]

Detecting an Incorrect Response

eval_input = {
    "input": "What is the capital of France?",
    "output": "London is the capital of France.",
}
scores = correctness_eval.evaluate(eval_input)
# Expected: score=0.0, label='incorrect'

Related Pages

Arize_ai_Phoenix_FaithfulnessEvaluator -- LLM-based evaluator for faithfulness (context-grounded correctness).
Arize_ai_Phoenix_HallucinationEvaluator -- Deprecated LLM-based hallucination evaluator.
Arize_ai_Phoenix_Evals_Public_API -- The top-level phoenix.evals public API surface.
Arize_ai_Phoenix_PrecisionRecallFScore -- Code-based precision/recall evaluator for comparison with LLM-based correctness.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment