Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Arize ai Phoenix Evaluator Design

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Evaluator Architecture, Classification
Last Updated 2026-02-14 00:00 GMT

Overview

Evaluator design is the discipline of defining reusable, composable evaluation criteria that assess LLM outputs against expected quality standards, producing structured scores suitable for aggregation and analysis.

Description

LLM applications require systematic quality measurement across dimensions such as relevance, hallucination, toxicity, factual accuracy, and domain-specific criteria. Manually inspecting outputs does not scale, and ad-hoc scripts lack the consistency needed for reliable comparisons across model versions or prompt iterations.

Evaluator design addresses these challenges by establishing a layered abstraction:

  • Evaluator base class defines the contract: every evaluator must accept an input record (a dictionary of field values) and return a list of Score objects. It also provides input validation, field remapping via bind(), OpenTelemetry tracing, and introspection through describe().
  • LLMEvaluator extends the base by adding an LLM instance, a prompt template with variable placeholders, and an optional tool/JSON schema for structured output. Input fields are inferred automatically from prompt template variables, and a Pydantic input schema is generated dynamically when none is supplied.
  • ClassificationEvaluator further specializes by adding a set of classification choices (labels, optional numeric scores, optional descriptions) and requesting the LLM to select among them. It supports label-only, label-with-score, and label-with-score-and-description formats.
  • create_evaluator decorator enables code-based evaluators by turning a plain Python function into a fully featured Evaluator instance, automatically converting the function's return value (number, boolean, string, dict, tuple, or Score) into a proper Score object.
  • create_classifier factory provides a concise function call to construct a ClassificationEvaluator without directly instantiating the class.
  • bind_evaluator helper creates a shallow copy of an evaluator with a fixed input mapping, so the same evaluator logic can operate on data with different column naming conventions.

This layered design allows teams to start with built-in evaluators (HallucinationEvaluator, QAEvaluator, RelevanceEvaluator, ToxicityEvaluator, SummarizationEvaluator, SQLEvaluator), customize them by writing their own prompt templates and classification choices, or create entirely code-based evaluators without an LLM dependency.

Usage

Use evaluator design principles when you need to:

  • Define custom evaluation criteria for a specific domain or task.
  • Combine LLM-based judgements with deterministic code-based checks in the same pipeline.
  • Ensure consistent, repeatable scoring across experiment iterations.
  • Reuse the same evaluation logic on datasets with varying column schemas (via input mapping).
  • Produce structured Score objects that can be aggregated, visualized, and tracked over time.

Theoretical Basis

Score Representation

Every evaluation produces one or more Score instances, a frozen dataclass with the following fields:

Score
  +-- name:        str                   # Identifier (typically the evaluator name)
  +-- score:       Optional[float|int]   # Numeric value (None for label-only)
  +-- label:       Optional[str]         # Categorical label
  +-- explanation: Optional[str]         # LLM-generated reasoning
  +-- metadata:    Dict[str, Any]        # Arbitrary metadata (model, trace_id, etc.)
  +-- kind:        "human"|"llm"|"code"  # Source of the evaluation
  +-- direction:   "maximize"|"minimize" # Optimization direction

The direction field indicates whether higher scores are better ("maximize") or worse ("minimize"), enabling automated optimization loops and dashboard visualizations to interpret results correctly.

Classification via Constrained Generation

ClassificationEvaluator leverages the LLM's structured output or tool-calling capabilities to constrain the response to one of the declared labels. This produces deterministic, parseable labels rather than freeform text, reducing post-processing errors. The workflow is:

1. Render prompt template with input variables
2. Generate classification schema from choices
3. Call LLM.generate_classification(prompt, labels, include_explanation)
4. Validate returned label against declared choices
5. Map label to numeric score (if score mapping provided)
6. Wrap result in Score(name, score, label, explanation, metadata, kind, direction)

Function-to-Evaluator Conversion

The create_evaluator decorator inspects the decorated function's signature to automatically build a Pydantic input schema. Return values are converted as follows:

Return Type Conversion Rule
Score Used directly (name, kind, direction overridden)
int / float Becomes Score.score
bool Becomes Score.score (as float) and Score.label (as string)
short str (up to 3 words) Becomes Score.label
long str (4+ words) Becomes Score.explanation
dict Keys "score", "label", "explanation" are extracted
tuple Elements are dispatched by type (number -> score, short string -> label, long string -> explanation)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment