Implementation:Arize ai Phoenix RanExperiment Analysis
| Knowledge Sources | |
|---|---|
| Domains | AI Observability, Experiment Analysis, Evaluation Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tools for analyzing experiment results provided by the Phoenix Client library, including the RanExperiment object, experiment retrieval, retroactive evaluation, and experiment resumption.
Description
After an experiment has been executed via run_experiment, the results are captured in a RanExperiment TypedDict that serves as the central data structure for result analysis. The Phoenix Client provides several functions to work with completed experiments:
- RanExperiment object: A TypedDict containing the experiment ID, dataset references, task runs, evaluation runs, and metadata. This is the return type of both run_experiment and evaluate_experiment.
- get_experiment(): Retrieves a completed experiment by its ID, loading all task runs and evaluation runs. Enables retrospective analysis of previously executed experiments.
- evaluate_experiment(): Runs additional evaluators against a completed experiment without re-executing the task. This is the primary mechanism for iterative evaluation refinement.
- resume_experiment(): Resumes an interrupted experiment by identifying and re-running only the missing or failed (example, repetition) pairs.
- resume_evaluation(): Resumes incomplete evaluations for an experiment, running evaluators only for runs that are missing evaluations.
These functions collectively support the full analysis and iteration workflow: retrieve an experiment, inspect its results, apply new evaluators, and recover from interruptions.
Usage
Use these tools after running an experiment to inspect results programmatically, add new evaluation criteria without re-running tasks, recover from interrupted experiments, and prepare data for cross-experiment comparison.
Code Reference
Source Location
- Repository: Phoenix
- RanExperiment type:
packages/phoenix-client/src/phoenix/client/resources/experiments/types.py(lines 362-378) - get_experiment:
packages/phoenix-client/src/phoenix/client/experiments/__init__.py(lines 403-461) - evaluate_experiment:
packages/phoenix-client/src/phoenix/client/experiments/__init__.py(lines 684-797) - resume_experiment:
packages/phoenix-client/src/phoenix/client/experiments/__init__.py(lines 464-539) - resume_evaluation:
packages/phoenix-client/src/phoenix/client/experiments/__init__.py(lines 800-874)
Signatures
get_experiment:
def get_experiment(
*,
experiment_id: str,
client: Optional["Client"] = None,
) -> RanExperiment
evaluate_experiment:
def evaluate_experiment(
*,
experiment: RanExperiment,
evaluators: ExperimentEvaluators,
dry_run: bool = False,
print_summary: bool = True,
timeout: Optional[int] = 60,
rate_limit_errors: Optional[RateLimitErrors] = None,
retries: int = 3,
client: Optional["Client"] = None,
) -> RanExperiment
resume_experiment:
def resume_experiment(
*,
experiment_id: str,
task: ExperimentTask,
evaluators: Optional[ExperimentEvaluators] = None,
print_summary: bool = True,
timeout: Optional[int] = 60,
rate_limit_errors: Optional[RateLimitErrors] = None,
retries: int = 3,
client: Optional["Client"] = None,
) -> None
resume_evaluation:
def resume_evaluation(
*,
experiment_id: str,
evaluators: ExperimentEvaluators,
print_summary: bool = True,
timeout: Optional[int] = 60,
rate_limit_errors: Optional[RateLimitErrors] = None,
retries: int = 3,
client: Optional["Client"] = None,
) -> None
Import
from phoenix.client.experiments import (
get_experiment,
evaluate_experiment,
resume_experiment,
resume_evaluation,
)
I/O Contract
Inputs (get_experiment)
| Name | Type | Required | Description |
|---|---|---|---|
| experiment_id | str | Yes | The unique ID of the experiment to retrieve. |
| client | Optional[Client] | No | Phoenix client instance. If None, a new client is created from environment variables. |
Inputs (evaluate_experiment)
| Name | Type | Required | Description |
|---|---|---|---|
| experiment | RanExperiment | Yes | The completed experiment to evaluate, as returned by run_experiment or get_experiment. |
| evaluators | ExperimentEvaluators | Yes | Single evaluator, list of evaluators, or dict mapping names to evaluators. |
| dry_run | bool | No | If True, evaluation results are not persisted. Default: False. |
| print_summary | bool | No | Whether to print a summary of evaluation results. Default: True. |
| timeout | Optional[int] | No | Timeout for evaluation execution in seconds. Default: 60. |
| rate_limit_errors | Optional[RateLimitErrors] | No | Exception types to adaptively throttle on. Default: None. |
| retries | int | No | Number of retry attempts for failed evaluations. Default: 3. |
| client | Optional[Client] | No | Phoenix client instance. If None, a new client is created from environment variables. |
Inputs (resume_experiment)
| Name | Type | Required | Description |
|---|---|---|---|
| experiment_id | str | Yes | The ID of the experiment to resume. |
| task | ExperimentTask | Yes | The task function to run on incomplete examples. |
| evaluators | Optional[ExperimentEvaluators] | No | Optional evaluators to run on completed task runs. |
| print_summary | bool | No | Whether to print a summary. Default: True. |
| timeout | Optional[int] | No | Timeout for task execution in seconds. Default: 60. |
| rate_limit_errors | Optional[RateLimitErrors] | No | Exception types to adaptively throttle on. Default: None. |
| retries | int | No | Number of retry attempts. Default: 3. |
| client | Optional[Client] | No | Phoenix client instance. |
Inputs (resume_evaluation)
| Name | Type | Required | Description |
|---|---|---|---|
| experiment_id | str | Yes | The ID of the experiment to resume evaluations for. |
| evaluators | ExperimentEvaluators | Yes | Evaluators to run. Names are matched to evaluator dict keys. |
| print_summary | bool | No | Whether to print a summary. Default: True. |
| timeout | Optional[int] | No | Timeout in seconds. Default: 60. |
| rate_limit_errors | Optional[RateLimitErrors] | No | Exception types to throttle on. |
| retries | int | No | Number of retry attempts. Default: 3. |
| client | Optional[Client] | No | Phoenix client instance. |
Outputs
| Function | Return Type | Description |
|---|---|---|
| get_experiment | RanExperiment | The complete experiment record with all task runs and evaluation runs. |
| evaluate_experiment | RanExperiment | Updated experiment record including the new evaluation runs. |
| resume_experiment | None | Resumes in-place; modifies the experiment on the server. |
| resume_evaluation | None | Resumes in-place; modifies the experiment on the server. |
RanExperiment Structure
| Field | Type | Description |
|---|---|---|
| experiment_id | str | Unique identifier for the experiment. |
| dataset_id | str | ID of the source dataset. |
| dataset_version_id | str | Pinned version ID of the dataset. |
| task_runs | list[ExperimentRun] | List of task execution results with output, error, and trace information. |
| evaluation_runs | list[ExperimentEvaluationRun] | List of evaluation results with score, label, explanation, and evaluator metadata. |
| experiment_metadata | Mapping[str, Any] | Arbitrary metadata associated with the experiment. |
| project_name | Optional[str] | Phoenix project name for trace organization. |
Usage Examples
Retrieve and Inspect a Completed Experiment
from phoenix.client.experiments import get_experiment
# Load a previously completed experiment
experiment = get_experiment(experiment_id="exp-abc123")
print(f"Experiment: {experiment['experiment_id']}")
print(f"Dataset: {experiment['dataset_id']}")
print(f"Version: {experiment['dataset_version_id']}")
print(f"Task runs: {len(experiment['task_runs'])}")
print(f"Evaluation runs: {len(experiment['evaluation_runs'])}")
# Inspect individual task runs
for run in experiment["task_runs"]:
if run.get("error"):
print(f" Run {run['id']}: ERROR - {run['error']}")
else:
print(f" Run {run['id']}: output = {run.get('output')}")
Add New Evaluators to an Existing Experiment
from phoenix.client.experiments import get_experiment, evaluate_experiment
# Retrieve experiment
experiment = get_experiment(experiment_id="exp-abc123")
# Define new evaluators
def conciseness(output):
"""Score based on output brevity."""
if not output:
return 0.0
length = len(str(output))
return max(0.0, 1.0 - length / 1000.0)
def factual_overlap(output, expected):
"""Check if key terms from expected appear in output."""
expected_terms = set(str(expected.get("answer", "")).lower().split())
output_terms = set(str(output).lower().split())
overlap = len(expected_terms & output_terms)
total = len(expected_terms) if expected_terms else 1
return overlap / total
# Apply new evaluators without re-running the task
evaluated = evaluate_experiment(
experiment=experiment,
evaluators={"conciseness": conciseness, "factual_overlap": factual_overlap},
)
print(f"Evaluation runs after: {len(evaluated['evaluation_runs'])}")
Dry Run Evaluation for Testing
from phoenix.client.experiments import get_experiment, evaluate_experiment
experiment = get_experiment(experiment_id="exp-abc123")
# Test evaluators without persisting results
test_result = evaluate_experiment(
experiment=experiment,
evaluators=[new_evaluator],
dry_run=True,
print_summary=True,
)
Resume an Interrupted Experiment
from phoenix.client.experiments import resume_experiment
def my_task(input):
return generate_answer(input["question"])
# Re-run only missing or failed (example, repetition) pairs
resume_experiment(
experiment_id="exp-abc123",
task=my_task,
retries=5,
)
Resume with Evaluators
from phoenix.client.experiments import resume_experiment
def my_task(input):
return generate_answer(input["question"])
def accuracy(output, expected):
return 1.0 if output == expected.get("answer") else 0.0
# Resume and also run evaluators on completed runs
resume_experiment(
experiment_id="exp-abc123",
task=my_task,
evaluators={"accuracy": accuracy},
)
Resume Incomplete Evaluations
from phoenix.client.experiments import resume_evaluation
def accuracy(output, expected):
return 1.0 if output == expected.get("answer") else 0.0
# Run evaluators only on runs missing the "accuracy" evaluation
resume_evaluation(
experiment_id="exp-abc123",
evaluators={"accuracy": accuracy},
)
Complete Iterative Workflow
from phoenix.client import Client
from phoenix.client.experiments import (
run_experiment,
get_experiment,
evaluate_experiment,
)
client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")
# Step 1: Run initial experiment with basic evaluator
def my_task(input):
return generate_answer(input["question"])
def basic_check(output):
return bool(output and len(str(output)) > 0)
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[basic_check],
experiment_name="iterative-experiment",
)
# Step 2: Analyze results and identify need for more evaluators
experiment_id = experiment["experiment_id"]
# Step 3: Define and apply additional evaluators
def accuracy(output, expected):
return 1.0 if output == expected.get("answer") else 0.0
def relevance(output, input):
question_words = set(input["question"].lower().split())
output_words = set(str(output).lower().split())
return len(question_words & output_words) / max(len(question_words), 1)
# Re-load experiment (or use the one from run_experiment)
experiment = get_experiment(experiment_id=experiment_id)
# Apply new evaluators without re-running the task
final = evaluate_experiment(
experiment=experiment,
evaluators={"accuracy": accuracy, "relevance": relevance},
)
print(f"Final evaluation runs: {len(final['evaluation_runs'])}")