Principle:Arize ai Phoenix Evaluator Design
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Evaluator Architecture, Classification |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Evaluator design is the discipline of defining reusable, composable evaluation criteria that assess LLM outputs against expected quality standards, producing structured scores suitable for aggregation and analysis.
Description
LLM applications require systematic quality measurement across dimensions such as relevance, hallucination, toxicity, factual accuracy, and domain-specific criteria. Manually inspecting outputs does not scale, and ad-hoc scripts lack the consistency needed for reliable comparisons across model versions or prompt iterations.
Evaluator design addresses these challenges by establishing a layered abstraction:
- Evaluator base class defines the contract: every evaluator must accept an input record (a dictionary of field values) and return a list of
Scoreobjects. It also provides input validation, field remapping viabind(), OpenTelemetry tracing, and introspection throughdescribe(). - LLMEvaluator extends the base by adding an LLM instance, a prompt template with variable placeholders, and an optional tool/JSON schema for structured output. Input fields are inferred automatically from prompt template variables, and a Pydantic input schema is generated dynamically when none is supplied.
- ClassificationEvaluator further specializes by adding a set of classification choices (labels, optional numeric scores, optional descriptions) and requesting the LLM to select among them. It supports label-only, label-with-score, and label-with-score-and-description formats.
- create_evaluator decorator enables code-based evaluators by turning a plain Python function into a fully featured
Evaluatorinstance, automatically converting the function's return value (number, boolean, string, dict, tuple, orScore) into a properScoreobject. - create_classifier factory provides a concise function call to construct a
ClassificationEvaluatorwithout directly instantiating the class. - bind_evaluator helper creates a shallow copy of an evaluator with a fixed input mapping, so the same evaluator logic can operate on data with different column naming conventions.
This layered design allows teams to start with built-in evaluators (HallucinationEvaluator, QAEvaluator, RelevanceEvaluator, ToxicityEvaluator, SummarizationEvaluator, SQLEvaluator), customize them by writing their own prompt templates and classification choices, or create entirely code-based evaluators without an LLM dependency.
Usage
Use evaluator design principles when you need to:
- Define custom evaluation criteria for a specific domain or task.
- Combine LLM-based judgements with deterministic code-based checks in the same pipeline.
- Ensure consistent, repeatable scoring across experiment iterations.
- Reuse the same evaluation logic on datasets with varying column schemas (via input mapping).
- Produce structured
Scoreobjects that can be aggregated, visualized, and tracked over time.
Theoretical Basis
Score Representation
Every evaluation produces one or more Score instances, a frozen dataclass with the following fields:
Score
+-- name: str # Identifier (typically the evaluator name)
+-- score: Optional[float|int] # Numeric value (None for label-only)
+-- label: Optional[str] # Categorical label
+-- explanation: Optional[str] # LLM-generated reasoning
+-- metadata: Dict[str, Any] # Arbitrary metadata (model, trace_id, etc.)
+-- kind: "human"|"llm"|"code" # Source of the evaluation
+-- direction: "maximize"|"minimize" # Optimization direction
The direction field indicates whether higher scores are better ("maximize") or worse ("minimize"), enabling automated optimization loops and dashboard visualizations to interpret results correctly.
Classification via Constrained Generation
ClassificationEvaluator leverages the LLM's structured output or tool-calling capabilities to constrain the response to one of the declared labels. This produces deterministic, parseable labels rather than freeform text, reducing post-processing errors. The workflow is:
1. Render prompt template with input variables
2. Generate classification schema from choices
3. Call LLM.generate_classification(prompt, labels, include_explanation)
4. Validate returned label against declared choices
5. Map label to numeric score (if score mapping provided)
6. Wrap result in Score(name, score, label, explanation, metadata, kind, direction)
Function-to-Evaluator Conversion
The create_evaluator decorator inspects the decorated function's signature to automatically build a Pydantic input schema. Return values are converted as follows:
| Return Type | Conversion Rule |
|---|---|
Score |
Used directly (name, kind, direction overridden) |
int / float |
Becomes Score.score
|
bool |
Becomes Score.score (as float) and Score.label (as string)
|
short str (up to 3 words) |
Becomes Score.label
|
long str (4+ words) |
Becomes Score.explanation
|
dict |
Keys "score", "label", "explanation" are extracted
|
tuple |
Elements are dispatched by type (number -> score, short string -> label, long string -> explanation) |