Implementation:Arize ai Phoenix Run Experiment
| Knowledge Sources | |
|---|---|
| Domains | AI Observability, Experiment Execution, Evaluation Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for executing reproducible experiments provided by the Phoenix Client library, orchestrating task execution across dataset examples with optional evaluation, retry logic, and result persistence.
Description
The run_experiment and async_run_experiment functions are the primary entry points for executing experiments in Phoenix. They orchestrate the complete experiment lifecycle: creating an experiment record on the server, executing the task function against each dataset example, optionally running evaluators on the results, and persisting everything to the Phoenix database.
Both functions are module-level convenience wrappers that create a Client (or AsyncClient) if one is not provided, then delegate to client.experiments.run_experiment(). This design allows experiments to be run with minimal setup while still supporting explicit client configuration for advanced use cases.
Key features include:
- Automatic retry: Failed task executions are retried up to a configurable number of times (default: 3).
- Rate limit handling: When rate_limit_errors is specified, the framework adaptively throttles task execution upon encountering those exception types.
- Dry run mode: When enabled, results are not persisted. Boolean True runs on 1 random example; an integer N runs on N random examples.
- Repetitions: Each example can be processed multiple times to measure output variance.
- Summary printing: By default, a summary of experiment and evaluation results is printed to stdout.
- Async concurrency: The async variant supports a concurrency parameter for parallel task execution.
Usage
Use run_experiment when you need to systematically evaluate a task function against a dataset. Use the async variant when your task involves I/O-bound operations (such as LLM API calls) that benefit from concurrent execution.
Code Reference
Source Location
- Repository: Phoenix
- File:
packages/phoenix-client/src/phoenix/client/experiments/__init__.py - run_experiment: Lines 17-204
- async_run_experiment: Lines 207-400
Signature (Sync)
def run_experiment(
*,
dataset: Dataset,
task: ExperimentTask,
evaluators: Optional[ExperimentEvaluators] = None,
experiment_name: Optional[str] = None,
experiment_description: Optional[str] = None,
experiment_metadata: Optional[Mapping[str, Any]] = None,
rate_limit_errors: Optional[RateLimitErrors] = None,
dry_run: Union[bool, int] = False,
print_summary: bool = True,
timeout: Optional[int] = 60,
repetitions: int = 1,
retries: int = 3,
client: Optional["Client"] = None,
) -> RanExperiment
Signature (Async)
async def async_run_experiment(
*,
dataset: Dataset,
task: ExperimentTask,
evaluators: Optional[ExperimentEvaluators] = None,
experiment_name: Optional[str] = None,
experiment_description: Optional[str] = None,
experiment_metadata: Optional[Mapping[str, Any]] = None,
rate_limit_errors: Optional[RateLimitErrors] = None,
dry_run: Union[bool, int] = False,
print_summary: bool = True,
concurrency: int = 3,
timeout: Optional[int] = 60,
repetitions: int = 1,
retries: int = 3,
client: Optional["AsyncClient"] = None,
) -> RanExperiment
Import
from phoenix.client.experiments import run_experiment, async_run_experiment
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset | Yes | The dataset on which to run the experiment. Obtained from client.datasets.get_dataset() or client.datasets.create_dataset(). |
| task | ExperimentTask | Yes | The task function to run on each example. Can be sync or async. Parameters are dynamically bound to example fields. |
| evaluators | Optional[ExperimentEvaluators] | No | Single evaluator, list of evaluators, or dict mapping names to evaluators. Applied to each task run after execution. Default: None. |
| experiment_name | Optional[str] | No | Human-readable name for the experiment. Default: None (auto-generated). |
| experiment_description | Optional[str] | No | Description of the experiment. Default: None. |
| experiment_metadata | Optional[Mapping[str, Any]] | No | Arbitrary metadata to associate with the experiment record. Default: None. |
| rate_limit_errors | Optional[RateLimitErrors] | No | Exception type or sequence of exception types to adaptively throttle on. Default: None. |
| dry_run | Union[bool, int] | No | If True, runs on 1 random example without persisting. If int, runs on that many random examples. Default: False. |
| print_summary | bool | No | Whether to print a summary of results to stdout. Default: True. |
| timeout | Optional[int] | No | Timeout for task execution in seconds. Default: 60. |
| repetitions | int | No | Number of times to run the task on each example. Default: 1. |
| retries | int | No | Number of retry attempts for failed task executions. Default: 3. |
| client | Optional[Client] | No | Phoenix client instance. If None, a new client is created from environment variables. Default: None. |
| concurrency | int | No | (Async only) Number of concurrent task executions. Default: 3. |
Outputs
| Name | Type | Description |
|---|---|---|
| RanExperiment | RanExperiment (TypedDict) | A completed experiment record containing experiment_id, dataset_id, dataset_version_id, task_runs, evaluation_runs, experiment_metadata, and project_name. |
RanExperiment Structure
| Field | Type | Description |
|---|---|---|
| experiment_id | str | Unique identifier for the experiment. |
| dataset_id | str | ID of the dataset used. |
| dataset_version_id | str | Pinned version ID of the dataset for reproducibility. |
| task_runs | list[ExperimentRun] | List of task execution results, one per (example, repetition) pair. |
| evaluation_runs | list[ExperimentEvaluationRun] | List of evaluation results from all evaluators applied to all runs. |
| experiment_metadata | Mapping[str, Any] | Metadata associated with the experiment. |
| project_name | Optional[str] | Name of the Phoenix project for trace organization. |
Usage Examples
Basic Experiment
from phoenix.client import Client
from phoenix.client.experiments import run_experiment
client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")
def my_task(input):
return f"The answer is: {input['question']}"
experiment = run_experiment(
dataset=dataset,
task=my_task,
experiment_name="basic-experiment",
)
print(f"Experiment ID: {experiment['experiment_id']}")
print(f"Total runs: {len(experiment['task_runs'])}")
Experiment with Evaluators
from phoenix.client import Client
from phoenix.client.experiments import run_experiment
client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")
def my_task(input):
return generate_answer(input["question"])
def accuracy(output, expected):
return 1.0 if output == expected.get("answer") else 0.0
def has_content(output):
return bool(output and len(str(output)) > 0)
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[accuracy, has_content],
experiment_name="evaluated-experiment",
)
Experiment with Named Evaluators
from phoenix.client.experiments import run_experiment
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators={
"accuracy": accuracy_evaluator,
"relevance": relevance_evaluator,
"fluency": fluency_evaluator,
},
experiment_name="multi-eval-experiment",
)
Dry Run for Development
from phoenix.client.experiments import run_experiment
# Run on 1 random example (results not persisted)
quick_test = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[accuracy],
dry_run=True,
)
# Run on 5 random examples (results not persisted)
sample_test = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[accuracy],
dry_run=5,
)
Experiment with Repetitions and Retries
import openai
from phoenix.client.experiments import run_experiment
def llm_task(input):
response = openai.Client().chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": input["question"]}],
)
return response.choices[0].message.content
# Run each example 3 times to measure variance, retry up to 5 times on failure
experiment = run_experiment(
dataset=dataset,
task=llm_task,
evaluators=[accuracy],
experiment_name="variance-experiment",
repetitions=3,
retries=5,
rate_limit_errors=(openai.RateLimitError,),
timeout=120,
)
Async Experiment with Concurrency
import openai
from phoenix.client.experiments import async_run_experiment
async_client = openai.AsyncOpenAI()
async def async_task(input):
response = await async_client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": input["question"]}],
)
return response.choices[0].message.content
experiment = await async_run_experiment(
dataset=dataset,
task=async_task,
evaluators=[accuracy],
experiment_name="async-experiment",
concurrency=10,
rate_limit_errors=(openai.RateLimitError,),
)
Experiment with Explicit Client
from phoenix.client import Client
from phoenix.client.experiments import run_experiment
# Configure client with specific endpoint
client = Client(endpoint="https://phoenix.example.com")
experiment = run_experiment(
client=client,
dataset=dataset,
task=my_task,
experiment_name="remote-experiment",
experiment_description="Testing against production Phoenix instance",
experiment_metadata={"model": "gpt-4", "temperature": 0.7},
)
Using Dataset Splits
from phoenix.client import Client
from phoenix.client.experiments import run_experiment
client = Client()
# Run experiment only on the test split
test_dataset = client.datasets.get_dataset(
dataset="qa-benchmark",
splits=["test"],
)
experiment = run_experiment(
dataset=test_dataset,
task=my_task,
evaluators=[accuracy],
experiment_name="test-split-experiment",
)