Workflow:Arize ai Phoenix Dataset and Experiment Lifecycle

Knowledge Sources	Arize Phoenix Phoenix Datasets Docs Phoenix Experiments Docs
Domains	AI_Observability, Experimentation, LLM_Evaluation
Last Updated	2026-02-14 06:00 GMT

Overview

End-to-end process for creating versioned datasets, running experiments with tasks and evaluators, and analyzing results in Phoenix.

Description

This workflow covers the complete dataset and experiment lifecycle in Phoenix. It begins with creating versioned datasets from various sources (DataFrames, exported spans, or manual construction), then runs experiments that execute a user-defined task function against each dataset example. Experiments are evaluated using one or more evaluators that produce scores, labels, and explanations. All results are persisted in Phoenix for comparison, analysis, and iterative improvement of LLM applications.

Key capabilities:

Versioned datasets with input/output/metadata columns and optional span associations
Flexible task functions that receive dataset examples and produce JSON-serializable output
Multiple evaluator types: functions, decorated evaluators, or phoenix-evals evaluators
Concurrent execution with configurable parallelism and rate limiting
Dry-run mode for testing experiments before full execution
Side-by-side comparison of experiment results in the Phoenix UI

Usage

Execute this workflow when you want to systematically test and compare different approaches to an LLM task. Common scenarios include: comparing prompt variations by running the same dataset through different prompts, benchmarking model changes (e.g., GPT-4 vs Claude), evaluating RAG pipeline modifications, or establishing baseline metrics for regression testing. This workflow requires a running Phoenix server and the arize-phoenix-client package.

Execution Steps

Step 1: Create a Dataset

Create a versioned dataset in Phoenix containing examples with input, output, and metadata fields. Datasets can be created from pandas DataFrames, from exported trace spans, or programmatically via the client API. Each example can optionally be linked to a source span via span_id.

Key considerations:

Specify input_keys, output_keys, and metadata_keys to map DataFrame columns to dataset fields
Use span_id_key to link examples back to their originating spans
Datasets are versioned: each modification creates a new version
Examples contain input (the task input), expected (reference output), and metadata (additional context)
Datasets can be retrieved by name or ID using client.datasets.get_dataset()

Step 2: Define the Task Function

Write a task function that takes a dataset example and produces output. The function receives the example's input (and optionally expected output, metadata, or the full example object) and returns a JSON-serializable result. This function represents the LLM application behavior being tested.

Key considerations:

Single-argument tasks receive the example's input field
Multi-argument tasks can request input, expected, reference, metadata, or example by parameter name
The task function should be deterministic or controlled (e.g., temperature=0) for reproducible results
Both synchronous and asynchronous task functions are supported
The task output is stored and passed to evaluators

Step 3: Define Evaluators

Create one or more evaluator functions that assess the quality of task outputs. Evaluators receive the task output and optionally the original input, expected output, and metadata. They return evaluation results as scores, labels, booleans, or structured EvaluationResult dictionaries.

Key considerations:

Evaluators can be plain functions, decorated with @create_evaluator, or phoenix-evals evaluators
Return types: bool (mapped to 0/1 score), float (score), str (label), tuple (score, explanation), or EvaluationResult dict
Evaluator parameter names determine what data they receive: output, input, expected, reference, metadata, example
Multiple evaluators can be passed as a list or a dict mapping names to functions

Step 4: Run the Experiment

Execute the experiment using run_experiment(), which runs the task function against each dataset example, then runs all evaluators against each task output. Results are stored in Phoenix for analysis.

Key considerations:

Use experiment_name and experiment_description for organization
Set dry_run=True to test on a single random example, or dry_run=N for N examples
Configure repetitions to run each example multiple times (useful for non-deterministic tasks)
Set retries for automatic retry on transient failures
Use rate_limit_errors to specify exception types that trigger adaptive throttling
The function prints a summary table of results by default

Step 5: Analyze Experiment Results

Review experiment results in the Phoenix UI or via the client API. Compare results across multiple experiments to identify improvements or regressions. Use evaluation scores and explanations to understand where the task performs well or poorly.

Key considerations:

The RanExperiment object contains all experiment runs and evaluation results
Results are visible in the Phoenix UI under the dataset's experiments tab
Multiple experiments can be compared side-by-side on the same dataset
Filter and sort by evaluator scores to identify problematic examples
Use explanations from LLM-based evaluators to understand failure modes

Execution Diagram

GitHub URL

Workflow Repository