Implementation:Arize ai Phoenix Experiment Types
| Knowledge Sources | |
|---|---|
| Domains | AI_Observability, Client_SDK, Experiments |
| Last Updated | 2026-02-14 05:30 GMT |
Overview
Type definitions and protocols for experiments, evaluators, test cases, and evaluation results within the Phoenix client experiment framework.
Description
The Experiment Types module defines the complete type system for running and evaluating experiments in the Phoenix client. It bridges auto-generated API types with user-facing protocols and data classes.
Core Type Aliases:
TaskOutput-- JSON-serializable output from experiment tasks (Optional[Union[dict, list, str, int, float, bool]]).ExampleInput,ExampleOutput,ExampleMetadata-- Mapping types for dataset example fields.Score,Label,Explanation-- Evaluation result primitives.ExperimentandExperimentRunare re-exported from the auto-generatedv1module.
Data Classes:
- ExperimentEvaluation -- Extends the auto-generated
ExperimentEvaluationResultTypedDict with optionalnameandmetadatafields. Represents a single evaluation result with optional label, score, and explanation. - TestCase -- A frozen dataclass pairing a
DatasetExamplewith arepetition_number, representing a single task invocation unit. - ExperimentEvaluationRun -- A frozen dataclass capturing the full lifecycle of an evaluation run, including timing (
start_time,end_time), annotator metadata, optional trace ID, and either a result or error. IDs are auto-generated with aDRY_RUN_prefix by default. - RanExperiment -- A TypedDict representing a completed experiment, containing the experiment ID, dataset metadata, task runs, evaluation runs, and optional project name. Used as input to
evaluate_experimentfor adding evaluations to a previously run experiment.
Protocols:
- Evaluator -- A
runtime_checkableProtocol requiringnameandkindproperties plusevaluate()andasync_evaluate()methods. Both methods accept keyword argumentsoutput,expected,metadata, andinputalong with**kwargs. - EvalsEvaluator -- A Protocol for backward compatibility with the
phoenix-evalspackage, requiringevaluate(),async_evaluate(), and attributesinput_schema,direction,source,name. - EvaluationScore -- A Protocol for individual score results from the evals package.
Abstract Base Class:
- BaseEvaluator -- An ABC implementing
Evaluatorwith default behavior. Subclasses must implement at least one ofevaluate()orasync_evaluate(). The__init_subclass__hook validates evaluator method signatures at class definition time, ensuring they accept**kwargsand use valid parameter names.
Proxy:
- ExampleProxy -- An immutable
Mapping[str, Any]proxy that wraps av1.DatasetExampleTypedDict to provide backward-compatible attribute access (e.g.,example.input) while preserving dictionary-style access. It convertsupdated_atfrom string todatetime.
Usage
Use these types when implementing custom experiment tasks and evaluators. Subclass BaseEvaluator for structured evaluators, or implement the Evaluator protocol directly. The type aliases ensure type safety across experiment creation, execution, and evaluation workflows.
Code Reference
Source Location
- Repository: Arize_ai_Phoenix
- File: packages/phoenix-client/src/phoenix/client/resources/experiments/types.py
Signature
class ExperimentEvaluation(v1.ExperimentEvaluationResult, total=False):
name: Optional[str]
metadata: Mapping[str, Any]
@dataclass(frozen=True)
class TestCase:
example: v1.DatasetExample
repetition_number: RepetitionNumber
@dataclass(frozen=True)
class ExperimentEvaluationRun:
experiment_run_id: ExperimentRunId
start_time: datetime
end_time: datetime
name: str
annotator_kind: str
error: Optional[str] = None
result: Optional[EvaluationResult] = None
id: str = field(default_factory=_dry_run_id)
trace_id: Optional[TraceId] = None
metadata: Mapping[str, JSONSerializable] = field(default_factory=dict)
@runtime_checkable
class Evaluator(Protocol):
@property
def name(self) -> str: ...
@property
def kind(self) -> str: ...
def evaluate(
self, *, output=None, expected=None, metadata=..., input=..., **kwargs
) -> EvaluationResult: ...
async def async_evaluate(
self, *, output=None, expected=None, metadata=..., input=..., **kwargs
) -> EvaluationResult: ...
class BaseEvaluator(ABC, Evaluator):
_kind: AnnotatorKind
_name: EvaluatorName
class RanExperiment(TypedDict):
experiment_id: ExperimentId
dataset_id: DatasetId
dataset_version_id: DatasetVersionId
task_runs: list[ExperimentRun]
evaluation_runs: list[ExperimentEvaluationRun]
experiment_metadata: Mapping[str, Any]
project_name: Optional[str]
class ExampleProxy(Mapping[str, Any]):
def __init__(self, wrapped: v1.DatasetExample) -> None: ...
Import
from phoenix.client.resources.experiments.types import (
ExperimentEvaluation,
TestCase,
ExperimentEvaluationRun,
BaseEvaluator,
Evaluator,
ExampleProxy,
RanExperiment,
)
I/O Contract
Evaluator Protocol
| Method | Parameters | Returns | Description |
|---|---|---|---|
evaluate() |
output: Optional[TaskOutput], expected: Optional[ExampleOutput], metadata: ExampleMetadata, input: ExampleInput, **kwargs |
EvaluationResult |
Synchronous evaluation |
async_evaluate() |
(same as evaluate) | EvaluationResult |
Asynchronous evaluation |
EvaluationResult
| Field | Type | Required | Description |
|---|---|---|---|
| label | str |
No | Categorical label for the evaluation |
| score | float |
No | Numeric score |
| explanation | str |
No | Free-text explanation |
ExperimentEvaluationRun
| Field | Type | Required | Description |
|---|---|---|---|
| experiment_run_id | str |
Yes | ID of the experiment run being evaluated |
| start_time | datetime |
Yes | When evaluation started |
| end_time | datetime |
Yes | When evaluation finished |
| name | str |
Yes | Name of the evaluator |
| annotator_kind | str |
Yes | "CODE" or "LLM"
|
| result | Optional[EvaluationResult] |
No | Evaluation result (required if error is None) |
| error | Optional[str] |
No | Error message (required if result is None) |
Usage Examples
from phoenix.client.resources.experiments.types import BaseEvaluator, EvaluationResult
class ExactnessEvaluator(BaseEvaluator):
"""Evaluator that checks if output exactly matches expected."""
_name = "exactness"
def evaluate(self, *, output=None, expected=None, **kwargs) -> EvaluationResult:
is_match = output == expected.get("answer") if expected else False
return {"label": "correct" if is_match else "incorrect", "score": float(is_match)}
evaluator = ExactnessEvaluator()
result = evaluator.evaluate(
output="42",
expected={"answer": "42"},
)
# {"label": "correct", "score": 1.0}
# Using ExampleProxy for backward compatibility
from phoenix.client.resources.experiments.types import ExampleProxy
example_dict = {"id": "ex1", "input": {"question": "What is 6*7?"}, "output": {"answer": "42"}, "metadata": {}, "updated_at": "2025-01-01T00:00:00Z"}
proxy = ExampleProxy(example_dict)
print(proxy.input) # {"question": "What is 6*7?"}
print(proxy["output"]) # {"answer": "42"}
print(proxy.updated_at) # datetime(2025, 1, 1, 0, 0, tzinfo=...)
Related Pages
- Principle:Arize_ai_Phoenix_Experiment_Execution
- Arize_ai_Phoenix_Client_Executors -- Executors that run experiment tasks and evaluations
- Arize_ai_Phoenix_Generated_V1_Types -- Auto-generated types used by experiment types