Principle:Arize ai Phoenix Experiment Execution
| Knowledge Sources | |
|---|---|
| Domains | AI Observability, Experiment Execution, Evaluation Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Experiment execution is the orchestrated process of systematically applying a task function to every example in a dataset, optionally evaluating the results, and persisting the complete record of runs and evaluations for analysis and comparison.
Description
An experiment in the context of AI evaluation is the combination of a dataset, a task function, and zero or more evaluators, executed together to produce a comprehensive record of system behavior. Experiment execution is the core orchestration mechanism that ties together dataset management, task definition, and evaluator definition into a single reproducible workflow.
The experiment execution process follows these phases:
- Setup: The experiment is configured with a dataset, task, evaluators, and metadata. A unique experiment record is created on the Phoenix server.
- Task execution: The task function is executed against each example in the dataset. Each execution produces a "run" that records the task output, any errors, timing information, and associated trace data.
- Evaluation: After task execution, each evaluator is applied to the task runs. Evaluation results (scores, labels, explanations) are associated with their corresponding runs.
- Persistence: All runs and evaluation results are stored in the Phoenix database, enabling later retrieval, comparison, and analysis through both the API and the Phoenix UI.
Key design decisions in the experiment execution model include:
- Retry resilience: Failed task executions are automatically retried (configurable number of retries), handling transient errors such as API rate limits or network timeouts.
- Rate limit awareness: The framework can adaptively throttle execution when encountering specified rate limit exceptions, preventing cascading failures when calling external APIs.
- Dry run mode: Experiments can be run in dry-run mode where results are not persisted, enabling rapid iteration during development. Dry run supports running on a single random example (boolean true) or a specified number of random examples (integer).
- Repetitions: Each example can be processed multiple times to measure output variance, which is particularly valuable for non-deterministic systems like language models.
- Client delegation: The module-level functions are thin wrappers that delegate to the Client (or AsyncClient) instance, which manages HTTP communication with the Phoenix server.
Usage
Experiment execution should be applied in the following scenarios:
- Model evaluation: When systematically testing a model or prompt against a curated dataset to measure quality metrics.
- Regression testing: When running an established benchmark after system changes to detect behavioral regressions.
- Prompt engineering: When comparing different prompt formulations by running each as a separate experiment against the same dataset.
- A/B testing: When evaluating two system variants side-by-side using identical evaluation data and criteria.
- Development iteration: When using dry-run mode to quickly test task and evaluator functions on a subset of examples before committing to a full experiment run.
- Variance measurement: When using repetitions to understand the distribution of outputs for non-deterministic systems.
- Resilient batch processing: When executing long-running tasks that may encounter transient failures, leveraging the retry and rate limit mechanisms.
Theoretical Basis
Experiment execution implements the systematic evaluation paradigm from machine learning operations, where model performance is measured through controlled, repeatable procedures.
The execution model can be formalized as:
Experiment(D, T, E) -> R
Where:
D = Dataset with examples {e_1, e_2, ..., e_n}
T = Task function (input -> output)
E = Set of evaluators {eval_1, eval_2, ..., eval_m}
R = RanExperiment containing:
- experiment_id: unique identifier
- dataset_id: reference to the source dataset
- dataset_version_id: pinned version for reproducibility
- task_runs: [{example_id, output, error, trace_id, ...}]
- evaluation_runs: [{run_id, evaluator_name, score, label, ...}]
The execution flow for each example with repetitions and retries:
for each example e_i in D:
for repetition r in 1..repetitions:
for attempt in 1..retries:
try:
output = T(bind_parameters(e_i))
record_run(experiment_id, e_i, r, output)
break
except RateLimitError:
adaptive_backoff()
except Exception:
if attempt == retries:
record_error(experiment_id, e_i, r, error)
for each run in task_runs:
for each evaluator eval_j in E:
result = eval_j(bind_evaluator_params(run, e_i))
record_evaluation(run_id, eval_j.name, result)
The dry run mechanism implements a sampling strategy:
if dry_run is True:
sampled_examples = random_sample(D, k=1)
elif dry_run is int:
sampled_examples = random_sample(D, k=dry_run)
else:
sampled_examples = D
The async execution variant introduces a concurrency parameter that controls the number of simultaneous task executions, enabling efficient utilization of I/O-bound resources (such as LLM API calls) while respecting rate limits and resource constraints.
The experiment record is immutable once created: task runs and evaluation runs are appended but never modified. This append-only model ensures that the complete execution history is preserved for audit and comparison purposes.