Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Arize ai Phoenix Dataset and Experiment Lifecycle

From Leeroopedia
Knowledge Sources
Domains AI_Observability, Experimentation, LLM_Evaluation
Last Updated 2026-02-14 06:00 GMT

Overview

End-to-end process for creating versioned datasets, running experiments with tasks and evaluators, and analyzing results in Phoenix.

Description

This workflow covers the complete dataset and experiment lifecycle in Phoenix. It begins with creating versioned datasets from various sources (DataFrames, exported spans, or manual construction), then runs experiments that execute a user-defined task function against each dataset example. Experiments are evaluated using one or more evaluators that produce scores, labels, and explanations. All results are persisted in Phoenix for comparison, analysis, and iterative improvement of LLM applications.

Key capabilities:

  • Versioned datasets with input/output/metadata columns and optional span associations
  • Flexible task functions that receive dataset examples and produce JSON-serializable output
  • Multiple evaluator types: functions, decorated evaluators, or phoenix-evals evaluators
  • Concurrent execution with configurable parallelism and rate limiting
  • Dry-run mode for testing experiments before full execution
  • Side-by-side comparison of experiment results in the Phoenix UI

Usage

Execute this workflow when you want to systematically test and compare different approaches to an LLM task. Common scenarios include: comparing prompt variations by running the same dataset through different prompts, benchmarking model changes (e.g., GPT-4 vs Claude), evaluating RAG pipeline modifications, or establishing baseline metrics for regression testing. This workflow requires a running Phoenix server and the arize-phoenix-client package.

Execution Steps

Step 1: Create a Dataset

Create a versioned dataset in Phoenix containing examples with input, output, and metadata fields. Datasets can be created from pandas DataFrames, from exported trace spans, or programmatically via the client API. Each example can optionally be linked to a source span via span_id.

Key considerations:

  • Specify input_keys, output_keys, and metadata_keys to map DataFrame columns to dataset fields
  • Use span_id_key to link examples back to their originating spans
  • Datasets are versioned: each modification creates a new version
  • Examples contain input (the task input), expected (reference output), and metadata (additional context)
  • Datasets can be retrieved by name or ID using client.datasets.get_dataset()

Step 2: Define the Task Function

Write a task function that takes a dataset example and produces output. The function receives the example's input (and optionally expected output, metadata, or the full example object) and returns a JSON-serializable result. This function represents the LLM application behavior being tested.

Key considerations:

  • Single-argument tasks receive the example's input field
  • Multi-argument tasks can request input, expected, reference, metadata, or example by parameter name
  • The task function should be deterministic or controlled (e.g., temperature=0) for reproducible results
  • Both synchronous and asynchronous task functions are supported
  • The task output is stored and passed to evaluators

Step 3: Define Evaluators

Create one or more evaluator functions that assess the quality of task outputs. Evaluators receive the task output and optionally the original input, expected output, and metadata. They return evaluation results as scores, labels, booleans, or structured EvaluationResult dictionaries.

Key considerations:

  • Evaluators can be plain functions, decorated with @create_evaluator, or phoenix-evals evaluators
  • Return types: bool (mapped to 0/1 score), float (score), str (label), tuple (score, explanation), or EvaluationResult dict
  • Evaluator parameter names determine what data they receive: output, input, expected, reference, metadata, example
  • Multiple evaluators can be passed as a list or a dict mapping names to functions

Step 4: Run the Experiment

Execute the experiment using run_experiment(), which runs the task function against each dataset example, then runs all evaluators against each task output. Results are stored in Phoenix for analysis.

Key considerations:

  • Use experiment_name and experiment_description for organization
  • Set dry_run=True to test on a single random example, or dry_run=N for N examples
  • Configure repetitions to run each example multiple times (useful for non-deterministic tasks)
  • Set retries for automatic retry on transient failures
  • Use rate_limit_errors to specify exception types that trigger adaptive throttling
  • The function prints a summary table of results by default

Step 5: Analyze Experiment Results

Review experiment results in the Phoenix UI or via the client API. Compare results across multiple experiments to identify improvements or regressions. Use evaluation scores and explanations to understand where the task performs well or poorly.

Key considerations:

  • The RanExperiment object contains all experiment runs and evaluation results
  • Results are visible in the Phoenix UI under the dataset's experiments tab
  • Multiple experiments can be compared side-by-side on the same dataset
  • Filter and sort by evaluator scores to identify problematic examples
  • Use explanations from LLM-based evaluators to understand failure modes

Execution Diagram

GitHub URL

Workflow Repository