Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix RanExperiment Analysis

From Leeroopedia
Knowledge Sources
Domains AI Observability, Experiment Analysis, Evaluation Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tools for analyzing experiment results provided by the Phoenix Client library, including the RanExperiment object, experiment retrieval, retroactive evaluation, and experiment resumption.

Description

After an experiment has been executed via run_experiment, the results are captured in a RanExperiment TypedDict that serves as the central data structure for result analysis. The Phoenix Client provides several functions to work with completed experiments:

  • RanExperiment object: A TypedDict containing the experiment ID, dataset references, task runs, evaluation runs, and metadata. This is the return type of both run_experiment and evaluate_experiment.
  • get_experiment(): Retrieves a completed experiment by its ID, loading all task runs and evaluation runs. Enables retrospective analysis of previously executed experiments.
  • evaluate_experiment(): Runs additional evaluators against a completed experiment without re-executing the task. This is the primary mechanism for iterative evaluation refinement.
  • resume_experiment(): Resumes an interrupted experiment by identifying and re-running only the missing or failed (example, repetition) pairs.
  • resume_evaluation(): Resumes incomplete evaluations for an experiment, running evaluators only for runs that are missing evaluations.

These functions collectively support the full analysis and iteration workflow: retrieve an experiment, inspect its results, apply new evaluators, and recover from interruptions.

Usage

Use these tools after running an experiment to inspect results programmatically, add new evaluation criteria without re-running tasks, recover from interrupted experiments, and prepare data for cross-experiment comparison.

Code Reference

Source Location

  • Repository: Phoenix
  • RanExperiment type: packages/phoenix-client/src/phoenix/client/resources/experiments/types.py (lines 362-378)
  • get_experiment: packages/phoenix-client/src/phoenix/client/experiments/__init__.py (lines 403-461)
  • evaluate_experiment: packages/phoenix-client/src/phoenix/client/experiments/__init__.py (lines 684-797)
  • resume_experiment: packages/phoenix-client/src/phoenix/client/experiments/__init__.py (lines 464-539)
  • resume_evaluation: packages/phoenix-client/src/phoenix/client/experiments/__init__.py (lines 800-874)

Signatures

get_experiment:

def get_experiment(
    *,
    experiment_id: str,
    client: Optional["Client"] = None,
) -> RanExperiment

evaluate_experiment:

def evaluate_experiment(
    *,
    experiment: RanExperiment,
    evaluators: ExperimentEvaluators,
    dry_run: bool = False,
    print_summary: bool = True,
    timeout: Optional[int] = 60,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    retries: int = 3,
    client: Optional["Client"] = None,
) -> RanExperiment

resume_experiment:

def resume_experiment(
    *,
    experiment_id: str,
    task: ExperimentTask,
    evaluators: Optional[ExperimentEvaluators] = None,
    print_summary: bool = True,
    timeout: Optional[int] = 60,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    retries: int = 3,
    client: Optional["Client"] = None,
) -> None

resume_evaluation:

def resume_evaluation(
    *,
    experiment_id: str,
    evaluators: ExperimentEvaluators,
    print_summary: bool = True,
    timeout: Optional[int] = 60,
    rate_limit_errors: Optional[RateLimitErrors] = None,
    retries: int = 3,
    client: Optional["Client"] = None,
) -> None

Import

from phoenix.client.experiments import (
    get_experiment,
    evaluate_experiment,
    resume_experiment,
    resume_evaluation,
)

I/O Contract

Inputs (get_experiment)

Name Type Required Description
experiment_id str Yes The unique ID of the experiment to retrieve.
client Optional[Client] No Phoenix client instance. If None, a new client is created from environment variables.

Inputs (evaluate_experiment)

Name Type Required Description
experiment RanExperiment Yes The completed experiment to evaluate, as returned by run_experiment or get_experiment.
evaluators ExperimentEvaluators Yes Single evaluator, list of evaluators, or dict mapping names to evaluators.
dry_run bool No If True, evaluation results are not persisted. Default: False.
print_summary bool No Whether to print a summary of evaluation results. Default: True.
timeout Optional[int] No Timeout for evaluation execution in seconds. Default: 60.
rate_limit_errors Optional[RateLimitErrors] No Exception types to adaptively throttle on. Default: None.
retries int No Number of retry attempts for failed evaluations. Default: 3.
client Optional[Client] No Phoenix client instance. If None, a new client is created from environment variables.

Inputs (resume_experiment)

Name Type Required Description
experiment_id str Yes The ID of the experiment to resume.
task ExperimentTask Yes The task function to run on incomplete examples.
evaluators Optional[ExperimentEvaluators] No Optional evaluators to run on completed task runs.
print_summary bool No Whether to print a summary. Default: True.
timeout Optional[int] No Timeout for task execution in seconds. Default: 60.
rate_limit_errors Optional[RateLimitErrors] No Exception types to adaptively throttle on. Default: None.
retries int No Number of retry attempts. Default: 3.
client Optional[Client] No Phoenix client instance.

Inputs (resume_evaluation)

Name Type Required Description
experiment_id str Yes The ID of the experiment to resume evaluations for.
evaluators ExperimentEvaluators Yes Evaluators to run. Names are matched to evaluator dict keys.
print_summary bool No Whether to print a summary. Default: True.
timeout Optional[int] No Timeout in seconds. Default: 60.
rate_limit_errors Optional[RateLimitErrors] No Exception types to throttle on.
retries int No Number of retry attempts. Default: 3.
client Optional[Client] No Phoenix client instance.

Outputs

Function Return Type Description
get_experiment RanExperiment The complete experiment record with all task runs and evaluation runs.
evaluate_experiment RanExperiment Updated experiment record including the new evaluation runs.
resume_experiment None Resumes in-place; modifies the experiment on the server.
resume_evaluation None Resumes in-place; modifies the experiment on the server.

RanExperiment Structure

Field Type Description
experiment_id str Unique identifier for the experiment.
dataset_id str ID of the source dataset.
dataset_version_id str Pinned version ID of the dataset.
task_runs list[ExperimentRun] List of task execution results with output, error, and trace information.
evaluation_runs list[ExperimentEvaluationRun] List of evaluation results with score, label, explanation, and evaluator metadata.
experiment_metadata Mapping[str, Any] Arbitrary metadata associated with the experiment.
project_name Optional[str] Phoenix project name for trace organization.

Usage Examples

Retrieve and Inspect a Completed Experiment

from phoenix.client.experiments import get_experiment

# Load a previously completed experiment
experiment = get_experiment(experiment_id="exp-abc123")

print(f"Experiment: {experiment['experiment_id']}")
print(f"Dataset: {experiment['dataset_id']}")
print(f"Version: {experiment['dataset_version_id']}")
print(f"Task runs: {len(experiment['task_runs'])}")
print(f"Evaluation runs: {len(experiment['evaluation_runs'])}")

# Inspect individual task runs
for run in experiment["task_runs"]:
    if run.get("error"):
        print(f"  Run {run['id']}: ERROR - {run['error']}")
    else:
        print(f"  Run {run['id']}: output = {run.get('output')}")

Add New Evaluators to an Existing Experiment

from phoenix.client.experiments import get_experiment, evaluate_experiment

# Retrieve experiment
experiment = get_experiment(experiment_id="exp-abc123")

# Define new evaluators
def conciseness(output):
    """Score based on output brevity."""
    if not output:
        return 0.0
    length = len(str(output))
    return max(0.0, 1.0 - length / 1000.0)

def factual_overlap(output, expected):
    """Check if key terms from expected appear in output."""
    expected_terms = set(str(expected.get("answer", "")).lower().split())
    output_terms = set(str(output).lower().split())
    overlap = len(expected_terms & output_terms)
    total = len(expected_terms) if expected_terms else 1
    return overlap / total

# Apply new evaluators without re-running the task
evaluated = evaluate_experiment(
    experiment=experiment,
    evaluators={"conciseness": conciseness, "factual_overlap": factual_overlap},
)

print(f"Evaluation runs after: {len(evaluated['evaluation_runs'])}")

Dry Run Evaluation for Testing

from phoenix.client.experiments import get_experiment, evaluate_experiment

experiment = get_experiment(experiment_id="exp-abc123")

# Test evaluators without persisting results
test_result = evaluate_experiment(
    experiment=experiment,
    evaluators=[new_evaluator],
    dry_run=True,
    print_summary=True,
)

Resume an Interrupted Experiment

from phoenix.client.experiments import resume_experiment

def my_task(input):
    return generate_answer(input["question"])

# Re-run only missing or failed (example, repetition) pairs
resume_experiment(
    experiment_id="exp-abc123",
    task=my_task,
    retries=5,
)

Resume with Evaluators

from phoenix.client.experiments import resume_experiment

def my_task(input):
    return generate_answer(input["question"])

def accuracy(output, expected):
    return 1.0 if output == expected.get("answer") else 0.0

# Resume and also run evaluators on completed runs
resume_experiment(
    experiment_id="exp-abc123",
    task=my_task,
    evaluators={"accuracy": accuracy},
)

Resume Incomplete Evaluations

from phoenix.client.experiments import resume_evaluation

def accuracy(output, expected):
    return 1.0 if output == expected.get("answer") else 0.0

# Run evaluators only on runs missing the "accuracy" evaluation
resume_evaluation(
    experiment_id="exp-abc123",
    evaluators={"accuracy": accuracy},
)

Complete Iterative Workflow

from phoenix.client import Client
from phoenix.client.experiments import (
    run_experiment,
    get_experiment,
    evaluate_experiment,
)

client = Client()
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

# Step 1: Run initial experiment with basic evaluator
def my_task(input):
    return generate_answer(input["question"])

def basic_check(output):
    return bool(output and len(str(output)) > 0)

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[basic_check],
    experiment_name="iterative-experiment",
)

# Step 2: Analyze results and identify need for more evaluators
experiment_id = experiment["experiment_id"]

# Step 3: Define and apply additional evaluators
def accuracy(output, expected):
    return 1.0 if output == expected.get("answer") else 0.0

def relevance(output, input):
    question_words = set(input["question"].lower().split())
    output_words = set(str(output).lower().split())
    return len(question_words & output_words) / max(len(question_words), 1)

# Re-load experiment (or use the one from run_experiment)
experiment = get_experiment(experiment_id=experiment_id)

# Apply new evaluators without re-running the task
final = evaluate_experiment(
    experiment=experiment,
    evaluators={"accuracy": accuracy, "relevance": relevance},
)

print(f"Final evaluation runs: {len(final['evaluation_runs'])}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment