Overview
CorrectnessEvaluator is an LLM-based classification evaluator in the arize-phoenix-evals package that assesses the factual accuracy and completeness of model outputs. It extends ClassificationEvaluator and uses a judge LLM to determine whether a response to a given input is correct or incorrect.
Description
The CorrectnessEvaluator delegates evaluation to an LLM judge by sending the input query and model output through a prompt template loaded from a pre-generated classification evaluator configuration (CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG). The LLM classifies the output into one of two categories and returns a score with an explanation.
The evaluator is configured with:
- NAME -- Loaded from the generated config, identifying the evaluator as
"correctness".
- PROMPT -- A
PromptTemplate constructed from the config's message templates.
- CHOICES -- The classification labels (correct / incorrect) from the config.
- DIRECTION -- The optimization direction from the config (maximize, since higher scores indicate correctness).
| Parameter |
Type |
Description
|
llm |
LLM |
The LLM instance to use as the judge for evaluation. Must support tool calling or structured output.
|
Usage
from phoenix.evals.metrics import CorrectnessEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CorrectnessEvaluator(llm=llm)
Code Reference
Class Attributes
| Attribute |
Description
|
NAME |
The evaluator name, loaded from CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.name.
|
PROMPT |
A PromptTemplate built from the config's messages.
|
CHOICES |
Classification labels from the config (e.g., correct, incorrect).
|
DIRECTION |
Optimization direction from the config.
|
Input Schema
Defined by the inner class CorrectnessInputSchema(BaseModel):
| Field |
Type |
Description
|
input |
str |
The input query or question.
|
output |
str |
The response to evaluate for correctness.
|
I/O Contract
Input
| Field |
Type |
Required |
Description
|
input |
str |
Yes |
The original query or question posed to the model.
|
output |
str |
Yes |
The model's response to be evaluated.
|
Output
Returns a list containing one Score object with the following fields:
| Field |
Description
|
name |
"correctness"
|
score |
1.0 if correct, 0.0 if incorrect.
|
label |
The classification label (e.g., "correct" or "incorrect").
|
explanation |
An explanation from the LLM judge describing its reasoning.
|
metadata |
Dictionary containing the model name used for evaluation.
|
kind |
"llm"
|
direction |
"maximize"
|
Usage Examples
Basic Correctness Evaluation
from phoenix.evals.metrics.correctness import CorrectnessEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
correctness_eval = CorrectnessEvaluator(llm=llm)
eval_input = {
"input": "What is the capital of France?",
"output": "Paris is the capital of France.",
}
scores = correctness_eval.evaluate(eval_input)
print(scores)
# [Score(name='correctness', score=1.0, label='correct',
# explanation='The response accurately states that Paris is the capital of France.',
# metadata={'model': 'gpt-4o-mini'},
# kind="llm", direction="maximize")]
Detecting an Incorrect Response
eval_input = {
"input": "What is the capital of France?",
"output": "London is the capital of France.",
}
scores = correctness_eval.evaluate(eval_input)
# Expected: score=0.0, label='incorrect'
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.