Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix Create Evaluator Decorator

From Leeroopedia
Knowledge Sources
Domains AI Observability, Evaluation Metrics, Experiment Design
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for defining experiment evaluators provided by the Phoenix Client library, using a decorator pattern to transform user-defined scoring functions into fully configured Evaluator instances.

Description

The create_evaluator decorator configures a synchronous or asynchronous function to be used as an experiment evaluator. It wraps the decorated function with parameter binding logic, result normalization (via a scorer function), and metadata (name and kind) to produce an Evaluator instance that conforms to the Evaluator protocol.

The decorator handles the following responsibilities:

  • Signature validation: Checks that the function's parameter names are valid binding names (input, output, expected, reference, metadata, example) or have defaults/**kwargs.
  • Parameter binding: At evaluation time, automatically binds function parameters to the corresponding values from the experiment run and dataset example. Single-argument evaluators bind to output by default.
  • Result normalization: Converts the function's return value into a structured EvaluationResult using either the default scorer or a custom scorer function provided by the user.
  • Async support: Handles both synchronous and asynchronous evaluator functions, creating the appropriate wrapper class.
  • Phoenix Evals integration: Transparently wraps evaluators from the phoenix-evals package (EvalsEvaluator protocol).

Usage

Use this decorator when you need to define custom evaluation criteria for Phoenix experiments. The decorator is the primary mechanism for creating evaluators that can be passed to run_experiment or evaluate_experiment.

Code Reference

Source Location

  • Repository: Phoenix
  • File: packages/phoenix-client/src/phoenix/client/resources/experiments/evaluators.py (lines 143-240)

Signature

def create_evaluator(
    kind: Union[str, AnnotatorKind] = AnnotatorKind.CODE,
    name: Optional[str] = None,
    scorer: Optional[Callable[[Any], EvaluationResult]] = None,
) -> Callable[[ExperimentEvaluator], Evaluator]

Import

from phoenix.client.experiments import create_evaluator

I/O Contract

Inputs (Decorator Parameters)

Name Type Required Description
kind Union[str, AnnotatorKind] No Broadly indicates how the evaluator scores a run. Valid values: "CODE" (deterministic logic), "LLM" (language model based). Default: AnnotatorKind.CODE.
name Optional[str] No The name of the evaluator. If not provided, defaults to the decorated function's name (via __qualname__).
scorer Optional[Callable[[Any], EvaluationResult]] No Custom function to convert the evaluator's return value into an EvaluationResult. If not provided, uses the default scorer.

Inputs (Evaluator Function Parameters)

Name Type Required Description
output Any Auto-bound (default for single-arg) The output produced by the experiment task. For single-argument evaluators, the parameter is bound to this value regardless of its name.
input Mapping[str, Any] Auto-bound The input field of the dataset example.
expected Mapping[str, Any] Auto-bound The expected or reference output from the dataset example.
reference Mapping[str, Any] Auto-bound Alias for expected.
metadata Mapping[str, Any] Auto-bound Metadata associated with the dataset example.
example ExampleProxy Auto-bound The complete dataset Example object.

Outputs

Name Type Description
Evaluator Evaluator An Evaluator instance with evaluate(**kwargs) and async_evaluate(**kwargs) methods that return EvaluationResult.

EvaluationResult Type

Field Type Description
score Optional[float] Numeric evaluation score.
label Optional[str] Categorical evaluation label.
explanation Optional[str] Human-readable explanation of the evaluation.
name Optional[str] Evaluator name (for multi-output evaluators).
metadata Optional[Mapping[str, Any]] Additional structured evaluation metadata.

Default Scorer Conversions

Return Type Resulting EvaluationResult
bool {score: 0.0 or 1.0, label: "False" or "True"}
int or float {score: float(value)}
str {label: value}
tuple (score, label) {score: float(score), label: str(label)}
tuple (score, label, explanation) {score: float(score), label: str(label), explanation: str(explanation)}
EvaluationResult dict Passed through directly
EvaluationScore (phoenix-evals) Converted to ExperimentEvaluation with score, label, explanation, name, metadata

Usage Examples

Boolean Evaluator (Exact Match)

from phoenix.client.experiments import create_evaluator

@create_evaluator(kind="CODE", name="exact-match")
def exact_match(output, expected):
    """Returns True if output matches expected answer exactly."""
    return output == expected.get("answer")

# Result: {score: 1.0, label: "True"} or {score: 0.0, label: "False"}

Numeric Scorer

from phoenix.client.experiments import create_evaluator

@create_evaluator(kind="CODE", name="length-score")
def length_score(output):
    """Scores output by its length (longer is better, up to 100)."""
    if not isinstance(output, str):
        return 0.0
    return min(len(output) / 100.0, 1.0)

# Result: {score: 0.45} (for a 45-character output)

LLM-Based Evaluator

import openai
from phoenix.client.experiments import create_evaluator

llm_client = openai.Client()

@create_evaluator(kind="LLM", name="relevance")
def relevance(output, input):
    """Uses an LLM to judge whether the output is relevant to the input."""
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Rate the relevance of this answer to the question on a scale of 0-1.\n"
                    f"Question: {input['question']}\n"
                    f"Answer: {output}\n"
                    f"Return only a number."
                ),
            },
        ],
    )
    return float(response.choices[0].message.content.strip())

# Result: {score: 0.85}

Tuple Evaluator (Score + Explanation)

from phoenix.client.experiments import create_evaluator

@create_evaluator(kind="CODE", name="levenshtein-distance")
def levenshtein_eval(output, expected):
    """Computes edit distance and returns score with explanation."""
    from textdistance import levenshtein
    expected_text = expected.get("answer", "")
    distance = levenshtein.normalized_similarity(str(output), expected_text)
    return (
        distance,
        f"Levenshtein similarity between output and expected: {distance:.2f}",
    )

# Result: {score: 0.87, explanation: "Levenshtein similarity between output and expected: 0.87"}

Custom Scorer

from phoenix.client.experiments import create_evaluator

def custom_scorer(result):
    """Convert a custom result format into EvaluationResult."""
    return {
        "score": result["numeric_score"],
        "label": "pass" if result["numeric_score"] > 0.5 else "fail",
        "explanation": result["reasoning"],
    }

@create_evaluator(kind="CODE", name="custom-eval", scorer=custom_scorer)
def custom_evaluator(output, expected):
    """Returns a custom dict that gets processed by the scorer."""
    similarity = compute_similarity(output, expected.get("answer", ""))
    return {
        "numeric_score": similarity,
        "reasoning": f"Cosine similarity: {similarity:.3f}",
    }

Async Evaluator

import openai
from phoenix.client.experiments import create_evaluator

async_llm = openai.AsyncOpenAI()

@create_evaluator(kind="LLM", name="async-quality")
async def quality_evaluator(output, input):
    """Async evaluator for concurrent evaluation execution."""
    response = await async_llm.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "user",
                "content": f"Rate quality 0-1: Q={input['question']} A={output}",
            },
        ],
    )
    return float(response.choices[0].message.content.strip())

Using Evaluators with Experiments

from phoenix.client import Client
from phoenix.client.experiments import run_experiment, create_evaluator

client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

def my_task(input):
    return f"Answer: {input['question']}"

@create_evaluator(kind="CODE", name="has-answer")
def has_answer(output):
    return isinstance(output, str) and len(output) > 0

@create_evaluator(kind="CODE", name="exact-match")
def exact_match(output, expected):
    return output == expected.get("answer")

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[has_answer, exact_match],
    experiment_name="evaluated-experiment",
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment