Implementation:Arize ai Phoenix Experiment Types

Knowledge Sources	Arize_ai_Phoenix
Domains	AI_Observability, Client_SDK, Experiments
Last Updated	2026-02-14 05:30 GMT

Overview

Type definitions and protocols for experiments, evaluators, test cases, and evaluation results within the Phoenix client experiment framework.

Description

The Experiment Types module defines the complete type system for running and evaluating experiments in the Phoenix client. It bridges auto-generated API types with user-facing protocols and data classes.

Core Type Aliases:

TaskOutput -- JSON-serializable output from experiment tasks (Optional[Union[dict, list, str, int, float, bool]]).
ExampleInput, ExampleOutput, ExampleMetadata -- Mapping types for dataset example fields.
Score, Label, Explanation -- Evaluation result primitives.
Experiment and ExperimentRun are re-exported from the auto-generated v1 module.

Data Classes:

ExperimentEvaluation -- Extends the auto-generated ExperimentEvaluationResult TypedDict with optional name and metadata fields. Represents a single evaluation result with optional label, score, and explanation.
TestCase -- A frozen dataclass pairing a DatasetExample with a repetition_number, representing a single task invocation unit.
ExperimentEvaluationRun -- A frozen dataclass capturing the full lifecycle of an evaluation run, including timing (start_time, end_time), annotator metadata, optional trace ID, and either a result or error. IDs are auto-generated with a DRY_RUN_ prefix by default.
RanExperiment -- A TypedDict representing a completed experiment, containing the experiment ID, dataset metadata, task runs, evaluation runs, and optional project name. Used as input to evaluate_experiment for adding evaluations to a previously run experiment.

Protocols:

Evaluator -- A runtime_checkable Protocol requiring name and kind properties plus evaluate() and async_evaluate() methods. Both methods accept keyword arguments output, expected, metadata, and input along with **kwargs.
EvalsEvaluator -- A Protocol for backward compatibility with the phoenix-evals package, requiring evaluate(), async_evaluate(), and attributes input_schema, direction, source, name.
EvaluationScore -- A Protocol for individual score results from the evals package.

Abstract Base Class:

BaseEvaluator -- An ABC implementing Evaluator with default behavior. Subclasses must implement at least one of evaluate() or async_evaluate(). The __init_subclass__ hook validates evaluator method signatures at class definition time, ensuring they accept **kwargs and use valid parameter names.

Proxy:

ExampleProxy -- An immutable Mapping[str, Any] proxy that wraps a v1.DatasetExample TypedDict to provide backward-compatible attribute access (e.g., example.input) while preserving dictionary-style access. It converts updated_at from string to datetime.

Usage

Use these types when implementing custom experiment tasks and evaluators. Subclass BaseEvaluator for structured evaluators, or implement the Evaluator protocol directly. The type aliases ensure type safety across experiment creation, execution, and evaluation workflows.

Code Reference

Source Location

Repository: Arize_ai_Phoenix
File: packages/phoenix-client/src/phoenix/client/resources/experiments/types.py

Signature

class ExperimentEvaluation(v1.ExperimentEvaluationResult, total=False):
    name: Optional[str]
    metadata: Mapping[str, Any]

@dataclass(frozen=True)
class TestCase:
    example: v1.DatasetExample
    repetition_number: RepetitionNumber

@dataclass(frozen=True)
class ExperimentEvaluationRun:
    experiment_run_id: ExperimentRunId
    start_time: datetime
    end_time: datetime
    name: str
    annotator_kind: str
    error: Optional[str] = None
    result: Optional[EvaluationResult] = None
    id: str = field(default_factory=_dry_run_id)
    trace_id: Optional[TraceId] = None
    metadata: Mapping[str, JSONSerializable] = field(default_factory=dict)

@runtime_checkable
class Evaluator(Protocol):
    @property
    def name(self) -> str: ...
    @property
    def kind(self) -> str: ...
    def evaluate(
        self, *, output=None, expected=None, metadata=..., input=..., **kwargs
    ) -> EvaluationResult: ...
    async def async_evaluate(
        self, *, output=None, expected=None, metadata=..., input=..., **kwargs
    ) -> EvaluationResult: ...

class BaseEvaluator(ABC, Evaluator):
    _kind: AnnotatorKind
    _name: EvaluatorName

class RanExperiment(TypedDict):
    experiment_id: ExperimentId
    dataset_id: DatasetId
    dataset_version_id: DatasetVersionId
    task_runs: list[ExperimentRun]
    evaluation_runs: list[ExperimentEvaluationRun]
    experiment_metadata: Mapping[str, Any]
    project_name: Optional[str]

class ExampleProxy(Mapping[str, Any]):
    def __init__(self, wrapped: v1.DatasetExample) -> None: ...

Import

from phoenix.client.resources.experiments.types import (
    ExperimentEvaluation,
    TestCase,
    ExperimentEvaluationRun,
    BaseEvaluator,
    Evaluator,
    ExampleProxy,
    RanExperiment,
)

I/O Contract

Evaluator Protocol

Method	Parameters	Returns	Description
`evaluate()`	`output: Optional[TaskOutput]`, `expected: Optional[ExampleOutput]`, `metadata: ExampleMetadata`, `input: ExampleInput`, `**kwargs`	`EvaluationResult`	Synchronous evaluation
`async_evaluate()`	(same as evaluate)	`EvaluationResult`	Asynchronous evaluation

EvaluationResult

Field	Type	Required	Description
label	`str`	No	Categorical label for the evaluation
score	`float`	No	Numeric score
explanation	`str`	No	Free-text explanation

ExperimentEvaluationRun

Field	Type	Required	Description
experiment_run_id	`str`	Yes	ID of the experiment run being evaluated
start_time	`datetime`	Yes	When evaluation started
end_time	`datetime`	Yes	When evaluation finished
name	`str`	Yes	Name of the evaluator
annotator_kind	`str`	Yes	`"CODE"` or `"LLM"`
result	`Optional[EvaluationResult]`	No	Evaluation result (required if error is None)
error	`Optional[str]`	No	Error message (required if result is None)

Usage Examples

from phoenix.client.resources.experiments.types import BaseEvaluator, EvaluationResult

class ExactnessEvaluator(BaseEvaluator):
    """Evaluator that checks if output exactly matches expected."""

    _name = "exactness"

    def evaluate(self, *, output=None, expected=None, **kwargs) -> EvaluationResult:
        is_match = output == expected.get("answer") if expected else False
        return {"label": "correct" if is_match else "incorrect", "score": float(is_match)}

evaluator = ExactnessEvaluator()
result = evaluator.evaluate(
    output="42",
    expected={"answer": "42"},
)
# {"label": "correct", "score": 1.0}

# Using ExampleProxy for backward compatibility
from phoenix.client.resources.experiments.types import ExampleProxy

example_dict = {"id": "ex1", "input": {"question": "What is 6*7?"}, "output": {"answer": "42"}, "metadata": {}, "updated_at": "2025-01-01T00:00:00Z"}
proxy = ExampleProxy(example_dict)
print(proxy.input)      # {"question": "What is 6*7?"}
print(proxy["output"])   # {"answer": "42"}
print(proxy.updated_at)  # datetime(2025, 1, 1, 0, 0, tzinfo=...)

Related Pages

Principle:Arize_ai_Phoenix_Experiment_Execution
Arize_ai_Phoenix_Client_Executors -- Executors that run experiment tasks and evaluations
Arize_ai_Phoenix_Generated_V1_Types -- Auto-generated types used by experiment types

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment