Implementation:Arize ai Phoenix RanExperiment Analysis

Knowledge Sources	Phoenix
Domains	AI Observability, Experiment Analysis, Evaluation Infrastructure
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tools for analyzing experiment results provided by the Phoenix Client library, including the RanExperiment object, experiment retrieval, retroactive evaluation, and experiment resumption.

Description

After an experiment has been executed via run_experiment, the results are captured in a RanExperiment TypedDict that serves as the central data structure for result analysis. The Phoenix Client provides several functions to work with completed experiments:

RanExperiment object: A TypedDict containing the experiment ID, dataset references, task runs, evaluation runs, and metadata. This is the return type of both run_experiment and evaluate_experiment.
get_experiment(): Retrieves a completed experiment by its ID, loading all task runs and evaluation runs. Enables retrospective analysis of previously executed experiments.
evaluate_experiment(): Runs additional evaluators against a completed experiment without re-executing the task. This is the primary mechanism for iterative evaluation refinement.
resume_experiment(): Resumes an interrupted experiment by identifying and re-running only the missing or failed (example, repetition) pairs.
resume_evaluation(): Resumes incomplete evaluations for an experiment, running evaluators only for runs that are missing evaluations.

These functions collectively support the full analysis and iteration workflow: retrieve an experiment, inspect its results, apply new evaluators, and recover from interruptions.

Usage

Use these tools after running an experiment to inspect results programmatically, add new evaluation criteria without re-running tasks, recover from interrupted experiments, and prepare data for cross-experiment comparison.

Code Reference

Source Location

Repository: Phoenix
RanExperiment type: packages/phoenix-client/src/phoenix/client/resources/experiments/types.py (lines 362-378)
get_experiment: packages/phoenix-client/src/phoenix/client/experiments/__init__.py (lines 403-461)
evaluate_experiment: packages/phoenix-client/src/phoenix/client/experiments/__init__.py (lines 684-797)
resume_experiment: packages/phoenix-client/src/phoenix/client/experiments/__init__.py (lines 464-539)
resume_evaluation: packages/phoenix-client/src/phoenix/client/experiments/__init__.py (lines 800-874)

Signatures

get_experiment:

def get_experiment(
    *,
    experiment_id: str,
    client: Optional["Client"] = None,
) -> RanExperiment

evaluate_experiment:

def evaluate_experiment(
    *,
    experiment: RanExperiment,
    evaluators: ExperimentEvaluators,
    dry_run: bool = False,
    print_summary: bool = True,
    timeout: Optional[int] = 60,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    retries: int = 3,
    client: Optional["Client"] = None,
) -> RanExperiment

resume_experiment:

def resume_experiment(
    *,
    experiment_id: str,
    task: ExperimentTask,
    evaluators: Optional[ExperimentEvaluators] = None,
    print_summary: bool = True,
    timeout: Optional[int] = 60,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    retries: int = 3,
    client: Optional["Client"] = None,
) -> None

resume_evaluation:

def resume_evaluation(
    *,
    experiment_id: str,
    evaluators: ExperimentEvaluators,
    print_summary: bool = True,
    timeout: Optional[int] = 60,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    retries: int = 3,
    client: Optional["Client"] = None,
) -> None

Import

from phoenix.client.experiments import (
    get_experiment,
    evaluate_experiment,
    resume_experiment,
    resume_evaluation,
)

I/O Contract

Inputs (get_experiment)

Name	Type	Required	Description
experiment_id	str	Yes	The unique ID of the experiment to retrieve.
client	Optional[Client]	No	Phoenix client instance. If None, a new client is created from environment variables.

Inputs (evaluate_experiment)

Name	Type	Required	Description
experiment	RanExperiment	Yes	The completed experiment to evaluate, as returned by run_experiment or get_experiment.
evaluators	ExperimentEvaluators	Yes	Single evaluator, list of evaluators, or dict mapping names to evaluators.
dry_run	bool	No	If True, evaluation results are not persisted. Default: False.
print_summary	bool	No	Whether to print a summary of evaluation results. Default: True.
timeout	Optional[int]	No	Timeout for evaluation execution in seconds. Default: 60.
rate_limit_errors	Optional[RateLimitErrors]	No	Exception types to adaptively throttle on. Default: None.
retries	int	No	Number of retry attempts for failed evaluations. Default: 3.
client	Optional[Client]	No	Phoenix client instance. If None, a new client is created from environment variables.

Inputs (resume_experiment)

Name	Type	Required	Description
experiment_id	str	Yes	The ID of the experiment to resume.
task	ExperimentTask	Yes	The task function to run on incomplete examples.
evaluators	Optional[ExperimentEvaluators]	No	Optional evaluators to run on completed task runs.
print_summary	bool	No	Whether to print a summary. Default: True.
timeout	Optional[int]	No	Timeout for task execution in seconds. Default: 60.
rate_limit_errors	Optional[RateLimitErrors]	No	Exception types to adaptively throttle on. Default: None.
retries	int	No	Number of retry attempts. Default: 3.
client	Optional[Client]	No	Phoenix client instance.

Inputs (resume_evaluation)

Name	Type	Required	Description
experiment_id	str	Yes	The ID of the experiment to resume evaluations for.
evaluators	ExperimentEvaluators	Yes	Evaluators to run. Names are matched to evaluator dict keys.
print_summary	bool	No	Whether to print a summary. Default: True.
timeout	Optional[int]	No	Timeout in seconds. Default: 60.
rate_limit_errors	Optional[RateLimitErrors]	No	Exception types to throttle on.
retries	int	No	Number of retry attempts. Default: 3.
client	Optional[Client]	No	Phoenix client instance.

Outputs

Function	Return Type	Description
get_experiment	RanExperiment	The complete experiment record with all task runs and evaluation runs.
evaluate_experiment	RanExperiment	Updated experiment record including the new evaluation runs.
resume_experiment	None	Resumes in-place; modifies the experiment on the server.
resume_evaluation	None	Resumes in-place; modifies the experiment on the server.

RanExperiment Structure

Field	Type	Description
experiment_id	str	Unique identifier for the experiment.
dataset_id	str	ID of the source dataset.
dataset_version_id	str	Pinned version ID of the dataset.
task_runs	list[ExperimentRun]	List of task execution results with output, error, and trace information.
evaluation_runs	list[ExperimentEvaluationRun]	List of evaluation results with score, label, explanation, and evaluator metadata.
experiment_metadata	Mapping[str, Any]	Arbitrary metadata associated with the experiment.
project_name	Optional[str]	Phoenix project name for trace organization.

Usage Examples

Retrieve and Inspect a Completed Experiment

from phoenix.client.experiments import get_experiment

# Load a previously completed experiment
experiment = get_experiment(experiment_id="exp-abc123")

print(f"Experiment: {experiment['experiment_id']}")
print(f"Dataset: {experiment['dataset_id']}")
print(f"Version: {experiment['dataset_version_id']}")
print(f"Task runs: {len(experiment['task_runs'])}")
print(f"Evaluation runs: {len(experiment['evaluation_runs'])}")

# Inspect individual task runs
for run in experiment["task_runs"]:
    if run.get("error"):
        print(f"  Run {run['id']}: ERROR - {run['error']}")
    else:
        print(f"  Run {run['id']}: output = {run.get('output')}")

Add New Evaluators to an Existing Experiment

from phoenix.client.experiments import get_experiment, evaluate_experiment

# Retrieve experiment
experiment = get_experiment(experiment_id="exp-abc123")

# Define new evaluators
def conciseness(output):
    """Score based on output brevity."""
    if not output:
        return 0.0
    length = len(str(output))
    return max(0.0, 1.0 - length / 1000.0)

def factual_overlap(output, expected):
    """Check if key terms from expected appear in output."""
    expected_terms = set(str(expected.get("answer", "")).lower().split())
    output_terms = set(str(output).lower().split())
    overlap = len(expected_terms & output_terms)
    total = len(expected_terms) if expected_terms else 1
    return overlap / total

# Apply new evaluators without re-running the task
evaluated = evaluate_experiment(
    experiment=experiment,
    evaluators={"conciseness": conciseness, "factual_overlap": factual_overlap},
)

print(f"Evaluation runs after: {len(evaluated['evaluation_runs'])}")

Dry Run Evaluation for Testing

from phoenix.client.experiments import get_experiment, evaluate_experiment

experiment = get_experiment(experiment_id="exp-abc123")

# Test evaluators without persisting results
test_result = evaluate_experiment(
    experiment=experiment,
    evaluators=[new_evaluator],
    dry_run=True,
    print_summary=True,
)

Resume an Interrupted Experiment

from phoenix.client.experiments import resume_experiment

def my_task(input):
    return generate_answer(input["question"])

# Re-run only missing or failed (example, repetition) pairs
resume_experiment(
    experiment_id="exp-abc123",
    task=my_task,
    retries=5,
)

Resume with Evaluators

from phoenix.client.experiments import resume_experiment

def my_task(input):
    return generate_answer(input["question"])

def accuracy(output, expected):
    return 1.0 if output == expected.get("answer") else 0.0

# Resume and also run evaluators on completed runs
resume_experiment(
    experiment_id="exp-abc123",
    task=my_task,
    evaluators={"accuracy": accuracy},
)

Resume Incomplete Evaluations

from phoenix.client.experiments import resume_evaluation

def accuracy(output, expected):
    return 1.0 if output == expected.get("answer") else 0.0

# Run evaluators only on runs missing the "accuracy" evaluation
resume_evaluation(
    experiment_id="exp-abc123",
    evaluators={"accuracy": accuracy},
)

Complete Iterative Workflow

from phoenix.client import Client
from phoenix.client.experiments import (
    run_experiment,
    get_experiment,
    evaluate_experiment,
)

client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

# Step 1: Run initial experiment with basic evaluator
def my_task(input):
    return generate_answer(input["question"])

def basic_check(output):
    return bool(output and len(str(output)) > 0)

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[basic_check],
    experiment_name="iterative-experiment",
)

# Step 2: Analyze results and identify need for more evaluators
experiment_id = experiment["experiment_id"]

# Step 3: Define and apply additional evaluators
def accuracy(output, expected):
    return 1.0 if output == expected.get("answer") else 0.0

def relevance(output, input):
    question_words = set(input["question"].lower().split())
    output_words = set(str(output).lower().split())
    return len(question_words & output_words) / max(len(question_words), 1)

# Re-load experiment (or use the one from run_experiment)
experiment = get_experiment(experiment_id=experiment_id)

# Apply new evaluators without re-running the task
final = evaluate_experiment(
    experiment=experiment,
    evaluators={"accuracy": accuracy, "relevance": relevance},
)

print(f"Final evaluation runs: {len(final['evaluation_runs'])}")

Related Pages

Implements Principle

Principle:Arize_ai_Phoenix_Experiment_Result_Analysis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment