Implementation:Arize ai Phoenix Run Experiment

Knowledge Sources	Phoenix
Domains	AI Observability, Experiment Execution, Evaluation Infrastructure
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for executing reproducible experiments provided by the Phoenix Client library, orchestrating task execution across dataset examples with optional evaluation, retry logic, and result persistence.

Description

The run_experiment and async_run_experiment functions are the primary entry points for executing experiments in Phoenix. They orchestrate the complete experiment lifecycle: creating an experiment record on the server, executing the task function against each dataset example, optionally running evaluators on the results, and persisting everything to the Phoenix database.

Both functions are module-level convenience wrappers that create a Client (or AsyncClient) if one is not provided, then delegate to client.experiments.run_experiment(). This design allows experiments to be run with minimal setup while still supporting explicit client configuration for advanced use cases.

Key features include:

Automatic retry: Failed task executions are retried up to a configurable number of times (default: 3).
Rate limit handling: When rate_limit_errors is specified, the framework adaptively throttles task execution upon encountering those exception types.
Dry run mode: When enabled, results are not persisted. Boolean True runs on 1 random example; an integer N runs on N random examples.
Repetitions: Each example can be processed multiple times to measure output variance.
Summary printing: By default, a summary of experiment and evaluation results is printed to stdout.
Async concurrency: The async variant supports a concurrency parameter for parallel task execution.

Usage

Use run_experiment when you need to systematically evaluate a task function against a dataset. Use the async variant when your task involves I/O-bound operations (such as LLM API calls) that benefit from concurrent execution.

Code Reference

Source Location

Repository: Phoenix
File: packages/phoenix-client/src/phoenix/client/experiments/__init__.py
run_experiment: Lines 17-204
async_run_experiment: Lines 207-400

Signature (Sync)

def run_experiment(
    *,
    dataset: Dataset,
    task: ExperimentTask,
    evaluators: Optional[ExperimentEvaluators] = None,
    experiment_name: Optional[str] = None,
    experiment_description: Optional[str] = None,
    experiment_metadata: Optional[Mapping[str, Any]] = None,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    dry_run: Union[bool, int] = False,
    print_summary: bool = True,
    timeout: Optional[int] = 60,
    repetitions: int = 1,
    retries: int = 3,
    client: Optional["Client"] = None,
) -> RanExperiment

Signature (Async)

async def async_run_experiment(
    *,
    dataset: Dataset,
    task: ExperimentTask,
    evaluators: Optional[ExperimentEvaluators] = None,
    experiment_name: Optional[str] = None,
    experiment_description: Optional[str] = None,
    experiment_metadata: Optional[Mapping[str, Any]] = None,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    dry_run: Union[bool, int] = False,
    print_summary: bool = True,
    concurrency: int = 3,
    timeout: Optional[int] = 60,
    repetitions: int = 1,
    retries: int = 3,
    client: Optional["AsyncClient"] = None,
) -> RanExperiment

Import

from phoenix.client.experiments import run_experiment, async_run_experiment

I/O Contract

Inputs

Name	Type	Required	Description
dataset	Dataset	Yes	The dataset on which to run the experiment. Obtained from client.datasets.get_dataset() or client.datasets.create_dataset().
task	ExperimentTask	Yes	The task function to run on each example. Can be sync or async. Parameters are dynamically bound to example fields.
evaluators	Optional[ExperimentEvaluators]	No	Single evaluator, list of evaluators, or dict mapping names to evaluators. Applied to each task run after execution. Default: None.
experiment_name	Optional[str]	No	Human-readable name for the experiment. Default: None (auto-generated).
experiment_description	Optional[str]	No	Description of the experiment. Default: None.
experiment_metadata	Optional[Mapping[str, Any]]	No	Arbitrary metadata to associate with the experiment record. Default: None.
rate_limit_errors	Optional[RateLimitErrors]	No	Exception type or sequence of exception types to adaptively throttle on. Default: None.
dry_run	Union[bool, int]	No	If True, runs on 1 random example without persisting. If int, runs on that many random examples. Default: False.
print_summary	bool	No	Whether to print a summary of results to stdout. Default: True.
timeout	Optional[int]	No	Timeout for task execution in seconds. Default: 60.
repetitions	int	No	Number of times to run the task on each example. Default: 1.
retries	int	No	Number of retry attempts for failed task executions. Default: 3.
client	Optional[Client]	No	Phoenix client instance. If None, a new client is created from environment variables. Default: None.
concurrency	int	No	(Async only) Number of concurrent task executions. Default: 3.

Outputs

Name	Type	Description
RanExperiment	RanExperiment (TypedDict)	A completed experiment record containing experiment_id, dataset_id, dataset_version_id, task_runs, evaluation_runs, experiment_metadata, and project_name.

RanExperiment Structure

Field	Type	Description
experiment_id	str	Unique identifier for the experiment.
dataset_id	str	ID of the dataset used.
dataset_version_id	str	Pinned version ID of the dataset for reproducibility.
task_runs	list[ExperimentRun]	List of task execution results, one per (example, repetition) pair.
evaluation_runs	list[ExperimentEvaluationRun]	List of evaluation results from all evaluators applied to all runs.
experiment_metadata	Mapping[str, Any]	Metadata associated with the experiment.
project_name	Optional[str]	Name of the Phoenix project for trace organization.

Usage Examples

Basic Experiment

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

def my_task(input):
    return f"The answer is: {input['question']}"

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    experiment_name="basic-experiment",
)

print(f"Experiment ID: {experiment['experiment_id']}")
print(f"Total runs: {len(experiment['task_runs'])}")

Experiment with Evaluators

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

def my_task(input):
    return generate_answer(input["question"])

def accuracy(output, expected):
    return 1.0 if output == expected.get("answer") else 0.0

def has_content(output):
    return bool(output and len(str(output)) > 0)

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[accuracy, has_content],
    experiment_name="evaluated-experiment",
)

Experiment with Named Evaluators

from phoenix.client.experiments import run_experiment

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators={
        "accuracy": accuracy_evaluator,
        "relevance": relevance_evaluator,
        "fluency": fluency_evaluator,
    },
    experiment_name="multi-eval-experiment",
)

Dry Run for Development

from phoenix.client.experiments import run_experiment

# Run on 1 random example (results not persisted)
quick_test = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[accuracy],
    dry_run=True,
)

# Run on 5 random examples (results not persisted)
sample_test = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[accuracy],
    dry_run=5,
)

Experiment with Repetitions and Retries

import openai
from phoenix.client.experiments import run_experiment

def llm_task(input):
    response = openai.Client().chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": input["question"]}],
    )
    return response.choices[0].message.content

# Run each example 3 times to measure variance, retry up to 5 times on failure
experiment = run_experiment(
    dataset=dataset,
    task=llm_task,
    evaluators=[accuracy],
    experiment_name="variance-experiment",
    repetitions=3,
    retries=5,
    rate_limit_errors=(openai.RateLimitError,),
    timeout=120,
)

Async Experiment with Concurrency

import openai
from phoenix.client.experiments import async_run_experiment

async_client = openai.AsyncOpenAI()

async def async_task(input):
    response = await async_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": input["question"]}],
    )
    return response.choices[0].message.content

experiment = await async_run_experiment(
    dataset=dataset,
    task=async_task,
    evaluators=[accuracy],
    experiment_name="async-experiment",
    concurrency=10,
    rate_limit_errors=(openai.RateLimitError,),
)

Experiment with Explicit Client

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

# Configure client with specific endpoint
client = Client(endpoint="https://phoenix.example.com")

experiment = run_experiment(
    client=client,
    dataset=dataset,
    task=my_task,
    experiment_name="remote-experiment",
    experiment_description="Testing against production Phoenix instance",
    experiment_metadata={"model": "gpt-4", "temperature": 0.7},
)

Using Dataset Splits

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()

# Run experiment only on the test split
test_dataset = client.datasets.get_dataset(
    dataset="qa-benchmark",
    splits=["test"],
)

experiment = run_experiment(
    dataset=test_dataset,
    task=my_task,
    evaluators=[accuracy],
    experiment_name="test-split-experiment",
)

Related Pages

Implements Principle

Principle:Arize_ai_Phoenix_Experiment_Execution

Requires Environment

Environment:Arize_ai_Phoenix_Python_Runtime

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment