Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Arize ai Phoenix Run Experiment

From Leeroopedia
Knowledge Sources
Domains AI Observability, Experiment Execution, Evaluation Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for executing reproducible experiments provided by the Phoenix Client library, orchestrating task execution across dataset examples with optional evaluation, retry logic, and result persistence.

Description

The run_experiment and async_run_experiment functions are the primary entry points for executing experiments in Phoenix. They orchestrate the complete experiment lifecycle: creating an experiment record on the server, executing the task function against each dataset example, optionally running evaluators on the results, and persisting everything to the Phoenix database.

Both functions are module-level convenience wrappers that create a Client (or AsyncClient) if one is not provided, then delegate to client.experiments.run_experiment(). This design allows experiments to be run with minimal setup while still supporting explicit client configuration for advanced use cases.

Key features include:

  • Automatic retry: Failed task executions are retried up to a configurable number of times (default: 3).
  • Rate limit handling: When rate_limit_errors is specified, the framework adaptively throttles task execution upon encountering those exception types.
  • Dry run mode: When enabled, results are not persisted. Boolean True runs on 1 random example; an integer N runs on N random examples.
  • Repetitions: Each example can be processed multiple times to measure output variance.
  • Summary printing: By default, a summary of experiment and evaluation results is printed to stdout.
  • Async concurrency: The async variant supports a concurrency parameter for parallel task execution.

Usage

Use run_experiment when you need to systematically evaluate a task function against a dataset. Use the async variant when your task involves I/O-bound operations (such as LLM API calls) that benefit from concurrent execution.

Code Reference

Source Location

  • Repository: Phoenix
  • File: packages/phoenix-client/src/phoenix/client/experiments/__init__.py
  • run_experiment: Lines 17-204
  • async_run_experiment: Lines 207-400

Signature (Sync)

def run_experiment(
    *,
    dataset: Dataset,
    task: ExperimentTask,
    evaluators: Optional[ExperimentEvaluators] = None,
    experiment_name: Optional[str] = None,
    experiment_description: Optional[str] = None,
    experiment_metadata: Optional[Mapping[str, Any]] = None,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    dry_run: Union[bool, int] = False,
    print_summary: bool = True,
    timeout: Optional[int] = 60,
    repetitions: int = 1,
    retries: int = 3,
    client: Optional["Client"] = None,
) -> RanExperiment

Signature (Async)

async def async_run_experiment(
    *,
    dataset: Dataset,
    task: ExperimentTask,
    evaluators: Optional[ExperimentEvaluators] = None,
    experiment_name: Optional[str] = None,
    experiment_description: Optional[str] = None,
    experiment_metadata: Optional[Mapping[str, Any]] = None,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    dry_run: Union[bool, int] = False,
    print_summary: bool = True,
    concurrency: int = 3,
    timeout: Optional[int] = 60,
    repetitions: int = 1,
    retries: int = 3,
    client: Optional["AsyncClient"] = None,
) -> RanExperiment

Import

from phoenix.client.experiments import run_experiment, async_run_experiment

I/O Contract

Inputs

Name Type Required Description
dataset Dataset Yes The dataset on which to run the experiment. Obtained from client.datasets.get_dataset() or client.datasets.create_dataset().
task ExperimentTask Yes The task function to run on each example. Can be sync or async. Parameters are dynamically bound to example fields.
evaluators Optional[ExperimentEvaluators] No Single evaluator, list of evaluators, or dict mapping names to evaluators. Applied to each task run after execution. Default: None.
experiment_name Optional[str] No Human-readable name for the experiment. Default: None (auto-generated).
experiment_description Optional[str] No Description of the experiment. Default: None.
experiment_metadata Optional[Mapping[str, Any]] No Arbitrary metadata to associate with the experiment record. Default: None.
rate_limit_errors Optional[RateLimitErrors] No Exception type or sequence of exception types to adaptively throttle on. Default: None.
dry_run Union[bool, int] No If True, runs on 1 random example without persisting. If int, runs on that many random examples. Default: False.
print_summary bool No Whether to print a summary of results to stdout. Default: True.
timeout Optional[int] No Timeout for task execution in seconds. Default: 60.
repetitions int No Number of times to run the task on each example. Default: 1.
retries int No Number of retry attempts for failed task executions. Default: 3.
client Optional[Client] No Phoenix client instance. If None, a new client is created from environment variables. Default: None.
concurrency int No (Async only) Number of concurrent task executions. Default: 3.

Outputs

Name Type Description
RanExperiment RanExperiment (TypedDict) A completed experiment record containing experiment_id, dataset_id, dataset_version_id, task_runs, evaluation_runs, experiment_metadata, and project_name.

RanExperiment Structure

Field Type Description
experiment_id str Unique identifier for the experiment.
dataset_id str ID of the dataset used.
dataset_version_id str Pinned version ID of the dataset for reproducibility.
task_runs list[ExperimentRun] List of task execution results, one per (example, repetition) pair.
evaluation_runs list[ExperimentEvaluationRun] List of evaluation results from all evaluators applied to all runs.
experiment_metadata Mapping[str, Any] Metadata associated with the experiment.
project_name Optional[str] Name of the Phoenix project for trace organization.

Usage Examples

Basic Experiment

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

def my_task(input):
    return f"The answer is: {input['question']}"

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    experiment_name="basic-experiment",
)

print(f"Experiment ID: {experiment['experiment_id']}")
print(f"Total runs: {len(experiment['task_runs'])}")

Experiment with Evaluators

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

def my_task(input):
    return generate_answer(input["question"])

def accuracy(output, expected):
    return 1.0 if output == expected.get("answer") else 0.0

def has_content(output):
    return bool(output and len(str(output)) > 0)

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[accuracy, has_content],
    experiment_name="evaluated-experiment",
)

Experiment with Named Evaluators

from phoenix.client.experiments import run_experiment

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators={
        "accuracy": accuracy_evaluator,
        "relevance": relevance_evaluator,
        "fluency": fluency_evaluator,
    },
    experiment_name="multi-eval-experiment",
)

Dry Run for Development

from phoenix.client.experiments import run_experiment

# Run on 1 random example (results not persisted)
quick_test = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[accuracy],
    dry_run=True,
)

# Run on 5 random examples (results not persisted)
sample_test = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[accuracy],
    dry_run=5,
)

Experiment with Repetitions and Retries

import openai
from phoenix.client.experiments import run_experiment

def llm_task(input):
    response = openai.Client().chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": input["question"]}],
    )
    return response.choices[0].message.content

# Run each example 3 times to measure variance, retry up to 5 times on failure
experiment = run_experiment(
    dataset=dataset,
    task=llm_task,
    evaluators=[accuracy],
    experiment_name="variance-experiment",
    repetitions=3,
    retries=5,
    rate_limit_errors=(openai.RateLimitError,),
    timeout=120,
)

Async Experiment with Concurrency

import openai
from phoenix.client.experiments import async_run_experiment

async_client = openai.AsyncOpenAI()

async def async_task(input):
    response = await async_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": input["question"]}],
    )
    return response.choices[0].message.content

experiment = await async_run_experiment(
    dataset=dataset,
    task=async_task,
    evaluators=[accuracy],
    experiment_name="async-experiment",
    concurrency=10,
    rate_limit_errors=(openai.RateLimitError,),
)

Experiment with Explicit Client

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

# Configure client with specific endpoint
client = Client(endpoint="https://phoenix.example.com")

experiment = run_experiment(
    client=client,
    dataset=dataset,
    task=my_task,
    experiment_name="remote-experiment",
    experiment_description="Testing against production Phoenix instance",
    experiment_metadata={"model": "gpt-4", "temperature": 0.7},
)

Using Dataset Splits

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()

# Run experiment only on the test split
test_dataset = client.datasets.get_dataset(
    dataset="qa-benchmark",
    splits=["test"],
)

experiment = run_experiment(
    dataset=test_dataset,
    task=my_task,
    evaluators=[accuracy],
    experiment_name="test-split-experiment",
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment