Implementation:Arize ai Phoenix ClassificationEvaluator Create

Knowledge Sources	Phoenix
Domains	LLM Evaluation, Evaluator Architecture, Classification
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tools for defining and instantiating LLM-based and code-based evaluators provided by arize-phoenix-evals.

Description

This implementation covers the primary APIs for creating evaluation criteria in Phoenix:

ClassificationEvaluator -- an LLM-based evaluator that constrains the model to select from a declared set of classification choices, optionally mapping each label to a numeric score and requesting an explanation.
create_evaluator() -- a decorator that turns any Python function into a fully featured Evaluator instance with automatic input schema generation and return-value-to-Score conversion.
create_classifier() -- a factory function that constructs a ClassificationEvaluator in a single call.
bind_evaluator() -- a helper that binds an evaluator with a fixed input mapping so that data columns with different names can be routed to the evaluator's expected fields.

Together, these APIs enable teams to build LLM-judged classification evaluators, deterministic code-based evaluators, and to adapt either kind to arbitrary data schemas without modifying the evaluator's core logic.

Usage

Use these APIs when you need to:

Build a classification evaluator that asks an LLM to choose among predefined labels (e.g., sentiment, relevance, toxicity).
Create a code-based evaluator from a plain function for deterministic checks (e.g., word count, regex validation, precision/recall).
Map data columns to evaluator input fields when your DataFrame schema does not match the evaluator's expected variable names.

Code Reference

Source Location

Repository: Phoenix
File: packages/phoenix-evals/src/phoenix/evals/evaluators.py
- Evaluator base: lines 278-480
- LLMEvaluator: lines 484-566
- ClassificationEvaluator: lines 570-794
- create_evaluator: lines 797-1097
- create_classifier: lines 1101-1186
- bind_evaluator: lines 1191-1282

Signature: ClassificationEvaluator

class ClassificationEvaluator(LLMEvaluator):
    def __init__(
        self,
        *,
        name: str,
        llm: LLM,
        prompt_template: Union[PromptLike, PromptTemplate, Template],
        choices: Union[
            List[str],
            Dict[str, Union[float, int]],
            Dict[str, Tuple[Union[float, int], str]],
        ],
        include_explanation: bool = True,
        input_schema: Optional[type[BaseModel]] = None,
        direction: DirectionType = "maximize",
        **kwargs: Any,
    ) -> None

Signature: create_evaluator

def create_evaluator(
    name: str,
    source: Optional[KindType] = None,
    direction: DirectionType = "maximize",
    kind: Optional[KindType] = None,
) -> Callable[[Callable[..., Any]], Evaluator]

Signature: create_classifier

def create_classifier(
    name: str,
    prompt_template: str,
    llm: LLM,
    choices: Union[
        List[str],
        Dict[str, Union[float, int]],
        Dict[str, Tuple[Union[float, int], str]],
    ],
    direction: DirectionType = "maximize",
) -> ClassificationEvaluator

Signature: bind_evaluator

def bind_evaluator(
    evaluator: Evaluator,
    input_mapping: InputMappingType,
) -> Evaluator

Import

from phoenix.evals import (
    ClassificationEvaluator,
    create_evaluator,
    create_classifier,
    bind_evaluator,
)

I/O Contract

ClassificationEvaluator Inputs

Name	Type	Required	Description
name	`str`	Yes	Identifier for the evaluator; also used as the `Score.name`.
llm	`LLM`	Yes	An initialized `LLM` instance with tool-calling or structured output support.
prompt_template	`Union[PromptLike, PromptTemplate, Template]`	Yes	Prompt with placeholder variables (e.g., `{text}`, `{question}`) that are filled from the input record.
choices	`Union[List[str], Dict[str, Union[float, int]], Dict[str, Tuple[Union[float, int], str]]]`	Yes	Classification labels. May be a list of strings, a dict mapping labels to numeric scores, or a dict mapping labels to (score, description) tuples.
include_explanation	`bool`	No (default `True`)	Whether to request the LLM to provide reasoning with its classification.
input_schema	`Optional[type[BaseModel]]`	No	Pydantic model for explicit input validation. If omitted, a model is dynamically generated from prompt template variables.
direction	`DirectionType`	No (default `"maximize"`)	Score optimization direction: `"maximize"` or `"minimize"`.
**kwargs	`Any`	No	Invocation parameters forwarded to the LLM client during generation.

create_evaluator Inputs

Name	Type	Required	Description
name	`str`	Yes	Identifier for the evaluator and the produced Scores.
kind	`Optional[KindType]`	No (default `"code"`)	Kind of evaluator: `"human"`, `"llm"`, or `"code"`.
direction	`DirectionType`	No (default `"maximize"`)	Score optimization direction.

bind_evaluator Inputs

Name	Type	Required	Description
evaluator	`Evaluator`	Yes	The evaluator instance to bind with a mapping.
input_mapping	`InputMappingType`	Yes	A dictionary mapping evaluator field names to data field names (strings) or callable transformations.

Outputs

API	Return Type	Description
`ClassificationEvaluator.__init__`	`ClassificationEvaluator`	An evaluator instance with `evaluate()` and `async_evaluate()` methods returning `List[Score]`.
`create_evaluator()(fn)`	`Evaluator`	A decorated function wrapped as an `Evaluator` with `evaluate()`, `async_evaluate()`, and direct `__call__`.
`create_classifier()`	`ClassificationEvaluator`	Same as constructing `ClassificationEvaluator` directly.
`bind_evaluator()`	`Evaluator`	A shallow copy of the evaluator with the input mapping bound.

Usage Examples

LLM Classification with Label-to-Score Mapping

from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o")

evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template=(
        "Given the following question and answer, rate the relevance.\n"
        "Question: {question}\n"
        "Answer: {answer}"
    ),
    choices={
        "highly_relevant": 1.0,
        "somewhat_relevant": 0.5,
        "not_relevant": 0.0,
    },
    include_explanation=True,
)

result = evaluator.evaluate({
    "question": "What is the capital of France?",
    "answer": "Paris is the capital city of France.",
})
print(result[0].label)        # "highly_relevant"
print(result[0].score)        # 1.0
print(result[0].explanation)  # LLM reasoning

Code-Based Evaluator with create_evaluator

from phoenix.evals import create_evaluator

@create_evaluator(name="word_count")
def word_count(text: str) -> int:
    return len(text.split())

# As an Evaluator
result = word_count.evaluate({"text": "Hello world"})
print(result[0].score)  # 2

# Direct function call still works
print(word_count(text="Hello world"))  # 2

Quick Classifier via Factory

from phoenix.evals import create_classifier, LLM

llm = LLM(provider="openai", model="gpt-4o")

sentiment = create_classifier(
    name="sentiment",
    prompt_template="Classify the sentiment of: {text}",
    llm=llm,
    choices=["positive", "negative", "neutral"],
)

result = sentiment.evaluate({"text": "I love this product!"})
print(result[0].label)  # "positive"

Binding Input Mappings

from phoenix.evals import create_evaluator, bind_evaluator

@create_evaluator(name="response_length")
def response_length(response: str) -> int:
    return len(response)

# DataFrame has column "answer" but evaluator expects "response"
bound = bind_evaluator(
    evaluator=response_length,
    input_mapping={"response": "answer"},
)

result = bound.evaluate({"answer": "Paris is the capital of France."})
print(result[0].score)  # 31

Classification with Descriptions (Advanced)

from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o")

evaluator = ClassificationEvaluator(
    name="factual_accuracy",
    llm=llm,
    prompt_template="Evaluate the factual accuracy of: {claim}",
    choices={
        "accurate": (1.0, "Factually correct information"),
        "partially_accurate": (0.5, "Some correct, some incorrect information"),
        "inaccurate": (0.0, "Factually incorrect information"),
    },
)

result = evaluator.evaluate({"claim": "The Earth orbits the Sun."})
print(result[0].label)  # "accurate"
print(result[0].score)  # 1.0

Related Pages

Implements Principle

Principle:Arize_ai_Phoenix_Evaluator_Design

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment