Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Arize ai Phoenix ClassificationEvaluator Create

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Evaluator Architecture, Classification
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tools for defining and instantiating LLM-based and code-based evaluators provided by arize-phoenix-evals.

Description

This implementation covers the primary APIs for creating evaluation criteria in Phoenix:

  • ClassificationEvaluator -- an LLM-based evaluator that constrains the model to select from a declared set of classification choices, optionally mapping each label to a numeric score and requesting an explanation.
  • create_evaluator() -- a decorator that turns any Python function into a fully featured Evaluator instance with automatic input schema generation and return-value-to-Score conversion.
  • create_classifier() -- a factory function that constructs a ClassificationEvaluator in a single call.
  • bind_evaluator() -- a helper that binds an evaluator with a fixed input mapping so that data columns with different names can be routed to the evaluator's expected fields.

Together, these APIs enable teams to build LLM-judged classification evaluators, deterministic code-based evaluators, and to adapt either kind to arbitrary data schemas without modifying the evaluator's core logic.

Usage

Use these APIs when you need to:

  • Build a classification evaluator that asks an LLM to choose among predefined labels (e.g., sentiment, relevance, toxicity).
  • Create a code-based evaluator from a plain function for deterministic checks (e.g., word count, regex validation, precision/recall).
  • Map data columns to evaluator input fields when your DataFrame schema does not match the evaluator's expected variable names.

Code Reference

Source Location

  • Repository: Phoenix
  • File: packages/phoenix-evals/src/phoenix/evals/evaluators.py
    • Evaluator base: lines 278-480
    • LLMEvaluator: lines 484-566
    • ClassificationEvaluator: lines 570-794
    • create_evaluator: lines 797-1097
    • create_classifier: lines 1101-1186
    • bind_evaluator: lines 1191-1282

Signature: ClassificationEvaluator

class ClassificationEvaluator(LLMEvaluator):
    def __init__(
        self,
        *,
        name: str,
        llm: LLM,
        prompt_template: Union[PromptLike, PromptTemplate, Template],
        choices: Union[
            List[str],
            Dict[str, Union[float, int]],
            Dict[str, Tuple[Union[float, int], str]],
        ],
        include_explanation: bool = True,
        input_schema: Optional[type[BaseModel]] = None,
        direction: DirectionType = "maximize",
        **kwargs: Any,
    ) -> None

Signature: create_evaluator

def create_evaluator(
    name: str,
    source: Optional[KindType] = None,
    direction: DirectionType = "maximize",
    kind: Optional[KindType] = None,
) -> Callable[[Callable[..., Any]], Evaluator]

Signature: create_classifier

def create_classifier(
    name: str,
    prompt_template: str,
    llm: LLM,
    choices: Union[
        List[str],
        Dict[str, Union[float, int]],
        Dict[str, Tuple[Union[float, int], str]],
    ],
    direction: DirectionType = "maximize",
) -> ClassificationEvaluator

Signature: bind_evaluator

def bind_evaluator(
    evaluator: Evaluator,
    input_mapping: InputMappingType,
) -> Evaluator

Import

from phoenix.evals import (
    ClassificationEvaluator,
    create_evaluator,
    create_classifier,
    bind_evaluator,
)

I/O Contract

ClassificationEvaluator Inputs

Name Type Required Description
name str Yes Identifier for the evaluator; also used as the Score.name.
llm LLM Yes An initialized LLM instance with tool-calling or structured output support.
prompt_template Union[PromptLike, PromptTemplate, Template] Yes Prompt with placeholder variables (e.g., {text}, {question}) that are filled from the input record.
choices Union[List[str], Dict[str, Union[float, int]], Dict[str, Tuple[Union[float, int], str]]] Yes Classification labels. May be a list of strings, a dict mapping labels to numeric scores, or a dict mapping labels to (score, description) tuples.
include_explanation bool No (default True) Whether to request the LLM to provide reasoning with its classification.
input_schema Optional[type[BaseModel]] No Pydantic model for explicit input validation. If omitted, a model is dynamically generated from prompt template variables.
direction DirectionType No (default "maximize") Score optimization direction: "maximize" or "minimize".
**kwargs Any No Invocation parameters forwarded to the LLM client during generation.

create_evaluator Inputs

Name Type Required Description
name str Yes Identifier for the evaluator and the produced Scores.
kind Optional[KindType] No (default "code") Kind of evaluator: "human", "llm", or "code".
direction DirectionType No (default "maximize") Score optimization direction.

bind_evaluator Inputs

Name Type Required Description
evaluator Evaluator Yes The evaluator instance to bind with a mapping.
input_mapping InputMappingType Yes A dictionary mapping evaluator field names to data field names (strings) or callable transformations.

Outputs

API Return Type Description
ClassificationEvaluator.__init__ ClassificationEvaluator An evaluator instance with evaluate() and async_evaluate() methods returning List[Score].
create_evaluator()(fn) Evaluator A decorated function wrapped as an Evaluator with evaluate(), async_evaluate(), and direct __call__.
create_classifier() ClassificationEvaluator Same as constructing ClassificationEvaluator directly.
bind_evaluator() Evaluator A shallow copy of the evaluator with the input mapping bound.

Usage Examples

LLM Classification with Label-to-Score Mapping

from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o")

evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template=(
        "Given the following question and answer, rate the relevance.\n"
        "Question: {question}\n"
        "Answer: {answer}"
    ),
    choices={
        "highly_relevant": 1.0,
        "somewhat_relevant": 0.5,
        "not_relevant": 0.0,
    },
    include_explanation=True,
)

result = evaluator.evaluate({
    "question": "What is the capital of France?",
    "answer": "Paris is the capital city of France.",
})
print(result[0].label)        # "highly_relevant"
print(result[0].score)        # 1.0
print(result[0].explanation)  # LLM reasoning

Code-Based Evaluator with create_evaluator

from phoenix.evals import create_evaluator

@create_evaluator(name="word_count")
def word_count(text: str) -> int:
    return len(text.split())

# As an Evaluator
result = word_count.evaluate({"text": "Hello world"})
print(result[0].score)  # 2

# Direct function call still works
print(word_count(text="Hello world"))  # 2

Quick Classifier via Factory

from phoenix.evals import create_classifier, LLM

llm = LLM(provider="openai", model="gpt-4o")

sentiment = create_classifier(
    name="sentiment",
    prompt_template="Classify the sentiment of: {text}",
    llm=llm,
    choices=["positive", "negative", "neutral"],
)

result = sentiment.evaluate({"text": "I love this product!"})
print(result[0].label)  # "positive"

Binding Input Mappings

from phoenix.evals import create_evaluator, bind_evaluator

@create_evaluator(name="response_length")
def response_length(response: str) -> int:
    return len(response)

# DataFrame has column "answer" but evaluator expects "response"
bound = bind_evaluator(
    evaluator=response_length,
    input_mapping={"response": "answer"},
)

result = bound.evaluate({"answer": "Paris is the capital of France."})
print(result[0].score)  # 31

Classification with Descriptions (Advanced)

from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o")

evaluator = ClassificationEvaluator(
    name="factual_accuracy",
    llm=llm,
    prompt_template="Evaluate the factual accuracy of: {claim}",
    choices={
        "accurate": (1.0, "Factually correct information"),
        "partially_accurate": (0.5, "Some correct, some incorrect information"),
        "inaccurate": (0.0, "Factually incorrect information"),
    },
)

result = evaluator.evaluate({"claim": "The Earth orbits the Sun."})
print(result[0].label)  # "accurate"
print(result[0].score)  # 1.0

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment