Overview
Concrete tool for defining experiment evaluators provided by the Phoenix Client library, using a decorator pattern to transform user-defined scoring functions into fully configured Evaluator instances.
Description
The create_evaluator decorator configures a synchronous or asynchronous function to be used as an experiment evaluator. It wraps the decorated function with parameter binding logic, result normalization (via a scorer function), and metadata (name and kind) to produce an Evaluator instance that conforms to the Evaluator protocol.
The decorator handles the following responsibilities:
- Signature validation: Checks that the function's parameter names are valid binding names (input, output, expected, reference, metadata, example) or have defaults/**kwargs.
- Parameter binding: At evaluation time, automatically binds function parameters to the corresponding values from the experiment run and dataset example. Single-argument evaluators bind to output by default.
- Result normalization: Converts the function's return value into a structured EvaluationResult using either the default scorer or a custom scorer function provided by the user.
- Async support: Handles both synchronous and asynchronous evaluator functions, creating the appropriate wrapper class.
- Phoenix Evals integration: Transparently wraps evaluators from the phoenix-evals package (EvalsEvaluator protocol).
Usage
Use this decorator when you need to define custom evaluation criteria for Phoenix experiments. The decorator is the primary mechanism for creating evaluators that can be passed to run_experiment or evaluate_experiment.
Code Reference
Source Location
- Repository: Phoenix
- File:
packages/phoenix-client/src/phoenix/client/resources/experiments/evaluators.py (lines 143-240)
Signature
def create_evaluator(
kind: Union[str, AnnotatorKind] = AnnotatorKind.CODE,
name: Optional[str] = None,
scorer: Optional[Callable[[Any], EvaluationResult]] = None,
) -> Callable[[ExperimentEvaluator], Evaluator]
Import
from phoenix.client.experiments import create_evaluator
I/O Contract
Inputs (Decorator Parameters)
| Name |
Type |
Required |
Description
|
| kind |
Union[str, AnnotatorKind] |
No |
Broadly indicates how the evaluator scores a run. Valid values: "CODE" (deterministic logic), "LLM" (language model based). Default: AnnotatorKind.CODE.
|
| name |
Optional[str] |
No |
The name of the evaluator. If not provided, defaults to the decorated function's name (via __qualname__).
|
| scorer |
Optional[Callable[[Any], EvaluationResult]] |
No |
Custom function to convert the evaluator's return value into an EvaluationResult. If not provided, uses the default scorer.
|
Inputs (Evaluator Function Parameters)
| Name |
Type |
Required |
Description
|
| output |
Any |
Auto-bound (default for single-arg) |
The output produced by the experiment task. For single-argument evaluators, the parameter is bound to this value regardless of its name.
|
| input |
Mapping[str, Any] |
Auto-bound |
The input field of the dataset example.
|
| expected |
Mapping[str, Any] |
Auto-bound |
The expected or reference output from the dataset example.
|
| reference |
Mapping[str, Any] |
Auto-bound |
Alias for expected.
|
| metadata |
Mapping[str, Any] |
Auto-bound |
Metadata associated with the dataset example.
|
| example |
ExampleProxy |
Auto-bound |
The complete dataset Example object.
|
Outputs
| Name |
Type |
Description
|
| Evaluator |
Evaluator |
An Evaluator instance with evaluate(**kwargs) and async_evaluate(**kwargs) methods that return EvaluationResult.
|
EvaluationResult Type
| Field |
Type |
Description
|
| score |
Optional[float] |
Numeric evaluation score.
|
| label |
Optional[str] |
Categorical evaluation label.
|
| explanation |
Optional[str] |
Human-readable explanation of the evaluation.
|
| name |
Optional[str] |
Evaluator name (for multi-output evaluators).
|
| metadata |
Optional[Mapping[str, Any]] |
Additional structured evaluation metadata.
|
Default Scorer Conversions
| Return Type |
Resulting EvaluationResult
|
| bool |
{score: 0.0 or 1.0, label: "False" or "True"}
|
| int or float |
{score: float(value)}
|
| str |
{label: value}
|
| tuple (score, label) |
{score: float(score), label: str(label)}
|
| tuple (score, label, explanation) |
{score: float(score), label: str(label), explanation: str(explanation)}
|
| EvaluationResult dict |
Passed through directly
|
| EvaluationScore (phoenix-evals) |
Converted to ExperimentEvaluation with score, label, explanation, name, metadata
|
Usage Examples
Boolean Evaluator (Exact Match)
from phoenix.client.experiments import create_evaluator
@create_evaluator(kind="CODE", name="exact-match")
def exact_match(output, expected):
"""Returns True if output matches expected answer exactly."""
return output == expected.get("answer")
# Result: {score: 1.0, label: "True"} or {score: 0.0, label: "False"}
Numeric Scorer
from phoenix.client.experiments import create_evaluator
@create_evaluator(kind="CODE", name="length-score")
def length_score(output):
"""Scores output by its length (longer is better, up to 100)."""
if not isinstance(output, str):
return 0.0
return min(len(output) / 100.0, 1.0)
# Result: {score: 0.45} (for a 45-character output)
LLM-Based Evaluator
import openai
from phoenix.client.experiments import create_evaluator
llm_client = openai.Client()
@create_evaluator(kind="LLM", name="relevance")
def relevance(output, input):
"""Uses an LLM to judge whether the output is relevant to the input."""
response = llm_client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "user",
"content": (
f"Rate the relevance of this answer to the question on a scale of 0-1.\n"
f"Question: {input['question']}\n"
f"Answer: {output}\n"
f"Return only a number."
),
},
],
)
return float(response.choices[0].message.content.strip())
# Result: {score: 0.85}
Tuple Evaluator (Score + Explanation)
from phoenix.client.experiments import create_evaluator
@create_evaluator(kind="CODE", name="levenshtein-distance")
def levenshtein_eval(output, expected):
"""Computes edit distance and returns score with explanation."""
from textdistance import levenshtein
expected_text = expected.get("answer", "")
distance = levenshtein.normalized_similarity(str(output), expected_text)
return (
distance,
f"Levenshtein similarity between output and expected: {distance:.2f}",
)
# Result: {score: 0.87, explanation: "Levenshtein similarity between output and expected: 0.87"}
Custom Scorer
from phoenix.client.experiments import create_evaluator
def custom_scorer(result):
"""Convert a custom result format into EvaluationResult."""
return {
"score": result["numeric_score"],
"label": "pass" if result["numeric_score"] > 0.5 else "fail",
"explanation": result["reasoning"],
}
@create_evaluator(kind="CODE", name="custom-eval", scorer=custom_scorer)
def custom_evaluator(output, expected):
"""Returns a custom dict that gets processed by the scorer."""
similarity = compute_similarity(output, expected.get("answer", ""))
return {
"numeric_score": similarity,
"reasoning": f"Cosine similarity: {similarity:.3f}",
}
Async Evaluator
import openai
from phoenix.client.experiments import create_evaluator
async_llm = openai.AsyncOpenAI()
@create_evaluator(kind="LLM", name="async-quality")
async def quality_evaluator(output, input):
"""Async evaluator for concurrent evaluation execution."""
response = await async_llm.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "user",
"content": f"Rate quality 0-1: Q={input['question']} A={output}",
},
],
)
return float(response.choices[0].message.content.strip())
Using Evaluators with Experiments
from phoenix.client import Client
from phoenix.client.experiments import run_experiment, create_evaluator
client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")
def my_task(input):
return f"Answer: {input['question']}"
@create_evaluator(kind="CODE", name="has-answer")
def has_answer(output):
return isinstance(output, str) and len(output) > 0
@create_evaluator(kind="CODE", name="exact-match")
def exact_match(output, expected):
return output == expected.get("answer")
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[has_answer, exact_match],
experiment_name="evaluated-experiment",
)
Related Pages
Implements Principle