Principle:Arize ai Phoenix Experiment Evaluator Definition
| Knowledge Sources | |
|---|---|
| Domains | AI Observability, Evaluation Metrics, Experiment Design |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Experiment evaluator definition is the practice of encapsulating an assessment criterion as a callable function that scores or labels the output of an experiment task, using dynamic parameter binding and automatic result normalization to produce structured evaluation records.
Description
In AI evaluation workflows, an evaluator is a function that assesses the quality of a task's output. Evaluators are the measurement instruments of the experiment framework: they take the output of a task (and optionally the input, expected output, and metadata from the dataset example) and produce a structured evaluation result consisting of scores, labels, and explanations.
The evaluator definition pattern mirrors the task definition pattern in its use of dynamic parameter binding by name. The framework inspects the evaluator function's parameter names and automatically binds them to the appropriate values. The available binding names for evaluators are:
- input: The input field of the dataset example.
- output: The output produced by the experiment task.
- expected: The expected or reference output from the dataset example.
- reference: An alias for expected.
- metadata: Metadata associated with the dataset example.
- example: The complete Example object with all associated fields.
For single-argument evaluators, the argument is automatically bound to the output of the task (not input, as with tasks), reflecting the most common use case of evaluating what the task produced.
A key feature of the evaluator definition pattern is automatic result normalization. Evaluator functions can return results in several convenient formats, and the framework automatically converts them into structured EvaluationResult records:
- bool: Converted to a score (0.0 or 1.0) and a label ("True" or "False").
- int or float: Converted to a numeric score.
- str: Converted to a categorical label.
- tuple (score, label) or (score, label, explanation): Converted to the corresponding fields.
- EvaluationResult dict: Passed through directly with optional score, label, explanation, name, and metadata fields.
Evaluators are classified by kind, which broadly indicates how the evaluation is performed:
- CODE: The evaluator uses deterministic logic (string matching, numerical comparison, rule-based checks).
- LLM: The evaluator uses a language model to assess quality (relevance, correctness, coherence).
Usage
Evaluator definition should be applied in the following scenarios:
- Correctness checks: When verifying that task output matches an expected answer exactly or approximately.
- Quality scoring: When assigning a numeric quality score to task outputs based on criteria such as relevance, completeness, or coherence.
- Classification: When categorizing task outputs into discrete labels (e.g., "pass"/"fail", sentiment categories).
- LLM-as-judge: When using a language model to evaluate the quality of another model's output, providing both a score and a natural language explanation.
- Multi-metric evaluation: When applying multiple evaluators to the same task output to capture different quality dimensions simultaneously.
- Custom scoring: When the default result normalization is insufficient, a custom scorer function can convert arbitrary evaluator outputs into structured EvaluationResult records.
Theoretical Basis
The evaluator definition pattern implements a strategy pattern for evaluation, where different scoring strategies can be composed and applied uniformly to experiment results.
The evaluation result type system is defined as:
ExperimentEvaluation = {
score: Optional[float], # Numeric assessment
label: Optional[str], # Categorical assessment
explanation: Optional[str], # Human-readable rationale
name: Optional[str], # Evaluator identity
metadata: Optional[Dict] # Additional structured data
}
EvaluationResult = Union[
ExperimentEvaluation, # Single evaluation
List[ExperimentEvaluation] # Multiple evaluations from one evaluator
]
The default scoring function implements the following conversion rules:
def default_scorer(result):
if isinstance(result, EvaluationScore): # phoenix-evals Score object
return convert_score_to_evaluation(result)
if isinstance(result, EvaluationResult): # dict passthrough
return result
if isinstance(result, bool):
return {score: float(result), label: str(result)}
if isinstance(result, (int, float)):
return {score: float(result)}
if isinstance(result, str):
return {label: result}
if isinstance(result, tuple) and len(result) >= 2:
return {score: result[0], label: result[1], explanation: result[2] if len > 2}
raise ValueError("Unsupported evaluation result type")
The Evaluator protocol defines the interface that all evaluators must satisfy:
Protocol Evaluator:
name: str # Evaluator identity
kind: str # "CODE" or "LLM"
evaluate(**kwargs) -> EvaluationResult # Sync evaluation
async_evaluate(**kwargs) -> EvaluationResult # Async evaluation
The BaseEvaluator abstract class provides a convenient base for implementing custom evaluators, with the following guarantees:
- Subclasses must implement at least one of evaluate or async_evaluate.
- The async version defaults to calling the sync version if not overridden.
- Method signatures are validated at class creation time to ensure compatibility.
The evaluator binding mechanism mirrors the task binding mechanism but defaults single-argument evaluators to the output parameter rather than input. This reflects the fundamental asymmetry between tasks (which consume inputs) and evaluators (which assess outputs).