Principle:Arize ai Phoenix Experiment Result Analysis
| Knowledge Sources | |
|---|---|
| Domains | AI Observability, Experiment Analysis, Evaluation Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Experiment result analysis is the practice of inspecting, comparing, and extending the outcomes of completed experiments to derive actionable insights about system quality and to iteratively refine evaluation criteria.
Description
After an experiment has been executed, the result analysis phase transforms raw run data and evaluation scores into understanding. This phase encompasses several activities:
- Programmatic inspection: Accessing the RanExperiment object to examine individual task runs, their outputs, any errors encountered, and the evaluation scores assigned by each evaluator.
- Visual analysis: Viewing experiment results in the Phoenix UI, which provides interactive dashboards for comparing experiments, filtering runs, and exploring evaluation distributions.
- Iterative evaluation: Running additional evaluators against a completed experiment without re-executing the task, enabling incremental refinement of evaluation criteria.
- Experiment retrieval: Loading previously completed experiments by ID for retrospective analysis or to add new evaluations.
- Experiment resumption: Continuing an interrupted experiment by re-running only the missing or failed runs, avoiding redundant computation.
The result analysis principle recognizes that evaluation is not a one-shot process but an iterative cycle. The typical workflow is:
- Run an initial experiment with a basic set of evaluators.
- Analyze the results to identify patterns, failures, and areas needing additional evaluation.
- Define new evaluators based on the insights gained.
- Apply the new evaluators to the existing experiment using evaluate_experiment.
- Repeat until the evaluation criteria sufficiently capture the quality dimensions of interest.
This iterative approach is particularly valuable because:
- Task execution is expensive: LLM calls, API requests, and pipeline executions consume time and resources. By decoupling task execution from evaluation, the framework avoids redundant computation when only the evaluation criteria change.
- Evaluation criteria evolve: As understanding of failure modes deepens, new evaluators are needed. The ability to retroactively apply evaluators to existing experiments preserves the complete evaluation history.
- Comparison requires consistency: When comparing experiments, it is essential that the same evaluators are applied to all experiments. Retroactive evaluation enables this even when evaluators are defined after some experiments have already run.
Usage
Experiment result analysis should be applied in the following scenarios:
- Post-experiment review: When examining the outcomes of a completed experiment to understand system performance across the evaluation dataset.
- Evaluator development: When iteratively building evaluation criteria by applying new evaluators to existing experiment results and inspecting the scores.
- Cross-experiment comparison: When comparing the results of multiple experiments to identify the best-performing system configuration.
- Failure analysis: When investigating specific task runs that produced unexpected outputs or low evaluation scores.
- Recovery from interruption: When an experiment was interrupted (due to network failure, rate limiting, or timeout) and needs to be resumed without re-running successful task executions.
- Audit and compliance: When retrieving historical experiment records to verify that evaluation was conducted properly.
Theoretical Basis
Experiment result analysis implements the observation-analysis-iteration cycle from empirical evaluation methodology.
The result data model is structured as:
RanExperiment = {
experiment_id: str,
dataset_id: str,
dataset_version_id: str,
task_runs: List[ExperimentRun],
evaluation_runs: List[ExperimentEvaluationRun],
experiment_metadata: Dict[str, Any],
project_name: Optional[str]
}
ExperimentRun = {
id: str,
example_id: str,
repetition_number: int,
output: Optional[JSONSerializable],
error: Optional[str],
trace_id: Optional[str],
start_time: datetime,
end_time: datetime
}
ExperimentEvaluationRun = {
id: str,
experiment_run_id: str,
name: str,
annotator_kind: str, # "CODE" or "LLM"
result: Optional[EvaluationResult],
error: Optional[str],
trace_id: Optional[str],
start_time: datetime,
end_time: datetime,
metadata: Dict[str, Any]
}
The analysis workflow supports three primary operations:
1. Retrieval: Loading a completed experiment for analysis.
get_experiment(experiment_id) -> RanExperiment
This operation fetches the complete experiment record including all task runs and evaluation runs. The returned RanExperiment can be inspected programmatically or passed to evaluate_experiment for additional evaluations.
2. Retroactive evaluation: Applying new evaluators to an existing experiment.
evaluate_experiment(experiment, evaluators) -> RanExperiment
This operation iterates over the task runs in the experiment, applies each evaluator to the run's output (along with the corresponding example data), and records the evaluation results. The returned RanExperiment includes both the original and new evaluation runs.
3. Resumption: Continuing an interrupted experiment or evaluation.
resume_experiment(experiment_id, task) -> None
resume_evaluation(experiment_id, evaluators) -> None
Resumption identifies which (example, repetition) pairs have not been completed (either missing or failed) and processes only those pairs. This is implemented using pagination to minimize memory usage for large experiments.
The separation of task execution and evaluation follows the separation of concerns principle: the task captures what the system does, while evaluators capture how well it does it. This separation enables the iterative refinement cycle that is essential for developing robust evaluation suites.