Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Arize ai Phoenix Experiment Result Analysis

From Leeroopedia
Knowledge Sources
Domains AI Observability, Experiment Analysis, Evaluation Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

Experiment result analysis is the practice of inspecting, comparing, and extending the outcomes of completed experiments to derive actionable insights about system quality and to iteratively refine evaluation criteria.

Description

After an experiment has been executed, the result analysis phase transforms raw run data and evaluation scores into understanding. This phase encompasses several activities:

  • Programmatic inspection: Accessing the RanExperiment object to examine individual task runs, their outputs, any errors encountered, and the evaluation scores assigned by each evaluator.
  • Visual analysis: Viewing experiment results in the Phoenix UI, which provides interactive dashboards for comparing experiments, filtering runs, and exploring evaluation distributions.
  • Iterative evaluation: Running additional evaluators against a completed experiment without re-executing the task, enabling incremental refinement of evaluation criteria.
  • Experiment retrieval: Loading previously completed experiments by ID for retrospective analysis or to add new evaluations.
  • Experiment resumption: Continuing an interrupted experiment by re-running only the missing or failed runs, avoiding redundant computation.

The result analysis principle recognizes that evaluation is not a one-shot process but an iterative cycle. The typical workflow is:

  1. Run an initial experiment with a basic set of evaluators.
  2. Analyze the results to identify patterns, failures, and areas needing additional evaluation.
  3. Define new evaluators based on the insights gained.
  4. Apply the new evaluators to the existing experiment using evaluate_experiment.
  5. Repeat until the evaluation criteria sufficiently capture the quality dimensions of interest.

This iterative approach is particularly valuable because:

  • Task execution is expensive: LLM calls, API requests, and pipeline executions consume time and resources. By decoupling task execution from evaluation, the framework avoids redundant computation when only the evaluation criteria change.
  • Evaluation criteria evolve: As understanding of failure modes deepens, new evaluators are needed. The ability to retroactively apply evaluators to existing experiments preserves the complete evaluation history.
  • Comparison requires consistency: When comparing experiments, it is essential that the same evaluators are applied to all experiments. Retroactive evaluation enables this even when evaluators are defined after some experiments have already run.

Usage

Experiment result analysis should be applied in the following scenarios:

  • Post-experiment review: When examining the outcomes of a completed experiment to understand system performance across the evaluation dataset.
  • Evaluator development: When iteratively building evaluation criteria by applying new evaluators to existing experiment results and inspecting the scores.
  • Cross-experiment comparison: When comparing the results of multiple experiments to identify the best-performing system configuration.
  • Failure analysis: When investigating specific task runs that produced unexpected outputs or low evaluation scores.
  • Recovery from interruption: When an experiment was interrupted (due to network failure, rate limiting, or timeout) and needs to be resumed without re-running successful task executions.
  • Audit and compliance: When retrieving historical experiment records to verify that evaluation was conducted properly.

Theoretical Basis

Experiment result analysis implements the observation-analysis-iteration cycle from empirical evaluation methodology.

The result data model is structured as:

RanExperiment = {
    experiment_id: str,
    dataset_id: str,
    dataset_version_id: str,
    task_runs: List[ExperimentRun],
    evaluation_runs: List[ExperimentEvaluationRun],
    experiment_metadata: Dict[str, Any],
    project_name: Optional[str]
}

ExperimentRun = {
    id: str,
    example_id: str,
    repetition_number: int,
    output: Optional[JSONSerializable],
    error: Optional[str],
    trace_id: Optional[str],
    start_time: datetime,
    end_time: datetime
}

ExperimentEvaluationRun = {
    id: str,
    experiment_run_id: str,
    name: str,
    annotator_kind: str,           # "CODE" or "LLM"
    result: Optional[EvaluationResult],
    error: Optional[str],
    trace_id: Optional[str],
    start_time: datetime,
    end_time: datetime,
    metadata: Dict[str, Any]
}

The analysis workflow supports three primary operations:

1. Retrieval: Loading a completed experiment for analysis.

get_experiment(experiment_id) -> RanExperiment

This operation fetches the complete experiment record including all task runs and evaluation runs. The returned RanExperiment can be inspected programmatically or passed to evaluate_experiment for additional evaluations.

2. Retroactive evaluation: Applying new evaluators to an existing experiment.

evaluate_experiment(experiment, evaluators) -> RanExperiment

This operation iterates over the task runs in the experiment, applies each evaluator to the run's output (along with the corresponding example data), and records the evaluation results. The returned RanExperiment includes both the original and new evaluation runs.

3. Resumption: Continuing an interrupted experiment or evaluation.

resume_experiment(experiment_id, task) -> None
resume_evaluation(experiment_id, evaluators) -> None

Resumption identifies which (example, repetition) pairs have not been completed (either missing or failed) and processes only those pairs. This is implemented using pagination to minimize memory usage for large experiments.

The separation of task execution and evaluation follows the separation of concerns principle: the task captures what the system does, while evaluators capture how well it does it. This separation enables the iterative refinement cycle that is essential for developing robust evaluation suites.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment