Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix Experiment Types

From Leeroopedia
Knowledge Sources
Domains AI_Observability, Client_SDK, Experiments
Last Updated 2026-02-14 05:30 GMT

Overview

Type definitions and protocols for experiments, evaluators, test cases, and evaluation results within the Phoenix client experiment framework.

Description

The Experiment Types module defines the complete type system for running and evaluating experiments in the Phoenix client. It bridges auto-generated API types with user-facing protocols and data classes.

Core Type Aliases:

  • TaskOutput -- JSON-serializable output from experiment tasks (Optional[Union[dict, list, str, int, float, bool]]).
  • ExampleInput, ExampleOutput, ExampleMetadata -- Mapping types for dataset example fields.
  • Score, Label, Explanation -- Evaluation result primitives.
  • Experiment and ExperimentRun are re-exported from the auto-generated v1 module.

Data Classes:

  • ExperimentEvaluation -- Extends the auto-generated ExperimentEvaluationResult TypedDict with optional name and metadata fields. Represents a single evaluation result with optional label, score, and explanation.
  • TestCase -- A frozen dataclass pairing a DatasetExample with a repetition_number, representing a single task invocation unit.
  • ExperimentEvaluationRun -- A frozen dataclass capturing the full lifecycle of an evaluation run, including timing (start_time, end_time), annotator metadata, optional trace ID, and either a result or error. IDs are auto-generated with a DRY_RUN_ prefix by default.
  • RanExperiment -- A TypedDict representing a completed experiment, containing the experiment ID, dataset metadata, task runs, evaluation runs, and optional project name. Used as input to evaluate_experiment for adding evaluations to a previously run experiment.

Protocols:

  • Evaluator -- A runtime_checkable Protocol requiring name and kind properties plus evaluate() and async_evaluate() methods. Both methods accept keyword arguments output, expected, metadata, and input along with **kwargs.
  • EvalsEvaluator -- A Protocol for backward compatibility with the phoenix-evals package, requiring evaluate(), async_evaluate(), and attributes input_schema, direction, source, name.
  • EvaluationScore -- A Protocol for individual score results from the evals package.

Abstract Base Class:

  • BaseEvaluator -- An ABC implementing Evaluator with default behavior. Subclasses must implement at least one of evaluate() or async_evaluate(). The __init_subclass__ hook validates evaluator method signatures at class definition time, ensuring they accept **kwargs and use valid parameter names.

Proxy:

  • ExampleProxy -- An immutable Mapping[str, Any] proxy that wraps a v1.DatasetExample TypedDict to provide backward-compatible attribute access (e.g., example.input) while preserving dictionary-style access. It converts updated_at from string to datetime.

Usage

Use these types when implementing custom experiment tasks and evaluators. Subclass BaseEvaluator for structured evaluators, or implement the Evaluator protocol directly. The type aliases ensure type safety across experiment creation, execution, and evaluation workflows.

Code Reference

Source Location

Signature

class ExperimentEvaluation(v1.ExperimentEvaluationResult, total=False):
    name: Optional[str]
    metadata: Mapping[str, Any]

@dataclass(frozen=True)
class TestCase:
    example: v1.DatasetExample
    repetition_number: RepetitionNumber

@dataclass(frozen=True)
class ExperimentEvaluationRun:
    experiment_run_id: ExperimentRunId
    start_time: datetime
    end_time: datetime
    name: str
    annotator_kind: str
    error: Optional[str] = None
    result: Optional[EvaluationResult] = None
    id: str = field(default_factory=_dry_run_id)
    trace_id: Optional[TraceId] = None
    metadata: Mapping[str, JSONSerializable] = field(default_factory=dict)

@runtime_checkable
class Evaluator(Protocol):
    @property
    def name(self) -> str: ...
    @property
    def kind(self) -> str: ...
    def evaluate(
        self, *, output=None, expected=None, metadata=..., input=..., **kwargs
    ) -> EvaluationResult: ...
    async def async_evaluate(
        self, *, output=None, expected=None, metadata=..., input=..., **kwargs
    ) -> EvaluationResult: ...

class BaseEvaluator(ABC, Evaluator):
    _kind: AnnotatorKind
    _name: EvaluatorName

class RanExperiment(TypedDict):
    experiment_id: ExperimentId
    dataset_id: DatasetId
    dataset_version_id: DatasetVersionId
    task_runs: list[ExperimentRun]
    evaluation_runs: list[ExperimentEvaluationRun]
    experiment_metadata: Mapping[str, Any]
    project_name: Optional[str]

class ExampleProxy(Mapping[str, Any]):
    def __init__(self, wrapped: v1.DatasetExample) -> None: ...

Import

from phoenix.client.resources.experiments.types import (
    ExperimentEvaluation,
    TestCase,
    ExperimentEvaluationRun,
    BaseEvaluator,
    Evaluator,
    ExampleProxy,
    RanExperiment,
)

I/O Contract

Evaluator Protocol

Method Parameters Returns Description
evaluate() output: Optional[TaskOutput], expected: Optional[ExampleOutput], metadata: ExampleMetadata, input: ExampleInput, **kwargs EvaluationResult Synchronous evaluation
async_evaluate() (same as evaluate) EvaluationResult Asynchronous evaluation

EvaluationResult

Field Type Required Description
label str No Categorical label for the evaluation
score float No Numeric score
explanation str No Free-text explanation

ExperimentEvaluationRun

Field Type Required Description
experiment_run_id str Yes ID of the experiment run being evaluated
start_time datetime Yes When evaluation started
end_time datetime Yes When evaluation finished
name str Yes Name of the evaluator
annotator_kind str Yes "CODE" or "LLM"
result Optional[EvaluationResult] No Evaluation result (required if error is None)
error Optional[str] No Error message (required if result is None)

Usage Examples

from phoenix.client.resources.experiments.types import BaseEvaluator, EvaluationResult

class ExactnessEvaluator(BaseEvaluator):
    """Evaluator that checks if output exactly matches expected."""

    _name = "exactness"

    def evaluate(self, *, output=None, expected=None, **kwargs) -> EvaluationResult:
        is_match = output == expected.get("answer") if expected else False
        return {"label": "correct" if is_match else "incorrect", "score": float(is_match)}

evaluator = ExactnessEvaluator()
result = evaluator.evaluate(
    output="42",
    expected={"answer": "42"},
)
# {"label": "correct", "score": 1.0}

# Using ExampleProxy for backward compatibility
from phoenix.client.resources.experiments.types import ExampleProxy

example_dict = {"id": "ex1", "input": {"question": "What is 6*7?"}, "output": {"answer": "42"}, "metadata": {}, "updated_at": "2025-01-01T00:00:00Z"}
proxy = ExampleProxy(example_dict)
print(proxy.input)      # {"question": "What is 6*7?"}
print(proxy["output"])   # {"answer": "42"}
print(proxy.updated_at)  # datetime(2025, 1, 1, 0, 0, tzinfo=...)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment