Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix CorrectnessEvaluator

From Leeroopedia
Revision as of 12:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Arize_ai_Phoenix_CorrectnessEvaluator.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

CorrectnessEvaluator is an LLM-based classification evaluator in the arize-phoenix-evals package that assesses the factual accuracy and completeness of model outputs. It extends ClassificationEvaluator and uses a judge LLM to determine whether a response to a given input is correct or incorrect.

Description

The CorrectnessEvaluator delegates evaluation to an LLM judge by sending the input query and model output through a prompt template loaded from a pre-generated classification evaluator configuration (CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG). The LLM classifies the output into one of two categories and returns a score with an explanation.

The evaluator is configured with:

  • NAME -- Loaded from the generated config, identifying the evaluator as "correctness".
  • PROMPT -- A PromptTemplate constructed from the config's message templates.
  • CHOICES -- The classification labels (correct / incorrect) from the config.
  • DIRECTION -- The optimization direction from the config (maximize, since higher scores indicate correctness).
Parameter Type Description
llm LLM The LLM instance to use as the judge for evaluation. Must support tool calling or structured output.

Usage

from phoenix.evals.metrics import CorrectnessEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CorrectnessEvaluator(llm=llm)

Code Reference

Property Value
Source File packages/phoenix-evals/src/phoenix/evals/metrics/correctness.py
Module phoenix.evals.metrics.correctness
Class CorrectnessEvaluator(ClassificationEvaluator)
Lines ~65
Kind "llm"
Direction "maximize"
Domain LLM Evaluation, Metrics

Class Attributes

Attribute Description
NAME The evaluator name, loaded from CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.name.
PROMPT A PromptTemplate built from the config's messages.
CHOICES Classification labels from the config (e.g., correct, incorrect).
DIRECTION Optimization direction from the config.

Input Schema

Defined by the inner class CorrectnessInputSchema(BaseModel):

Field Type Description
input str The input query or question.
output str The response to evaluate for correctness.

I/O Contract

Input

Field Type Required Description
input str Yes The original query or question posed to the model.
output str Yes The model's response to be evaluated.

Output

Returns a list containing one Score object with the following fields:

Field Description
name "correctness"
score 1.0 if correct, 0.0 if incorrect.
label The classification label (e.g., "correct" or "incorrect").
explanation An explanation from the LLM judge describing its reasoning.
metadata Dictionary containing the model name used for evaluation.
kind "llm"
direction "maximize"

Usage Examples

Basic Correctness Evaluation

from phoenix.evals.metrics.correctness import CorrectnessEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
correctness_eval = CorrectnessEvaluator(llm=llm)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
}
scores = correctness_eval.evaluate(eval_input)
print(scores)
# [Score(name='correctness', score=1.0, label='correct',
#     explanation='The response accurately states that Paris is the capital of France.',
#     metadata={'model': 'gpt-4o-mini'},
#     kind="llm", direction="maximize")]

Detecting an Incorrect Response

eval_input = {
    "input": "What is the capital of France?",
    "output": "London is the capital of France.",
}
scores = correctness_eval.evaluate(eval_input)
# Expected: score=0.0, label='incorrect'

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment