Principle:Arize ai Phoenix Experiment Task Definition

Knowledge Sources	Phoenix Phoenix Client Experiments
Domains	AI Observability, Experiment Design, Evaluation Infrastructure
Last Updated	2026-02-14 00:00 GMT

Overview

Experiment task definition is the practice of encapsulating a unit of work as a callable function whose parameters are dynamically bound to dataset example fields, enabling reproducible execution across a collection of evaluation inputs.

Description

In AI evaluation workflows, a task represents the system under test. It is a user-defined function that takes structured input from a dataset example and produces a JSON-serializable output. The task function is the bridge between the static evaluation dataset and the dynamic behavior of the system being evaluated.

The core innovation of the task definition pattern is dynamic parameter binding by name. Rather than requiring a fixed function signature, the framework inspects the parameter names of the task function and automatically binds them to corresponding fields from the dataset example. This allows task authors to request exactly the data they need without boilerplate extraction logic.

The available binding names and their corresponding values are:

input: The input field of the dataset example (a dictionary of key-value pairs representing the test input).
expected: The expected or reference output from the dataset example (the ground truth against which results may be compared).
reference: An alias for expected, providing a more intuitive name in certain contexts.
metadata: Metadata associated with the dataset example (additional context that may influence task behavior).
example: The complete Example object with all associated fields, for tasks that need access to the full example structure.

For single-argument functions, the argument is automatically bound to the input field regardless of the parameter name. This provides a convenient shorthand for the most common case.

Usage

Task definition should be applied in the following scenarios:

LLM evaluation: When testing how a language model responds to a set of prompts, the task function wraps the model call and returns the generated response.
Pipeline testing: When evaluating a multi-step processing pipeline, the task function encapsulates the entire pipeline from input to output.
A/B comparison: When comparing different system configurations, each configuration is expressed as a separate task function run against the same dataset.
Regression testing: When verifying that system behavior has not degraded, the task function captures the current system behavior for comparison against expected outputs.
Context-aware tasks: When the task needs access to metadata (such as retrieval context or user preferences) in addition to the primary input, multi-parameter binding allows clean access to all relevant fields.

Theoretical Basis

The task definition pattern is grounded in the concept of parameterized test execution from software testing theory, adapted for AI evaluation.

The type definition for an experiment task is:

ExperimentTask = Union[
    Callable[..., Any],           # Sync function returning JSON-serializable output
    Callable[..., Awaitable[Any]] # Async function returning JSON-serializable output
]

The parameter binding algorithm follows these steps:

1. Inspect the function signature to extract parameter names.
2. If the function has exactly one parameter:
   a. If the parameter name matches a known binding name, bind to that field.
   b. Otherwise, bind the single parameter to the "input" field.
3. If the function has multiple parameters:
   a. For each parameter name, look up the corresponding value from the example.
   b. Known binding names: {input, expected, reference, metadata, example}.
   c. Parameters with defaults or **kwargs are permitted but not required.
4. Deep-copy bound values to prevent task functions from mutating shared state.
5. Execute the function with the bound arguments.
6. Capture the return value as the task output (must be JSON-serializable).

This binding mechanism follows the dependency injection pattern, where the framework resolves dependencies based on naming conventions rather than explicit wiring. The deep-copy step ensures isolation between task executions, preventing one example's processing from affecting another's.

The support for both synchronous and asynchronous task functions enables efficient execution patterns. Synchronous tasks are suitable for CPU-bound computations, while asynchronous tasks allow concurrent I/O-bound operations (such as API calls to language models) that can be parallelized across dataset examples.

The task output must be JSON-serializable to ensure it can be stored in the Phoenix database, transmitted over the wire, and consumed by downstream evaluators. This constraint enforces a clean separation between the task execution and result analysis phases of the experiment.

Related Pages

Implemented By

Implementation:Arize_ai_Phoenix_Experiment_Task_Interface

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment