Principle:Arize ai Phoenix Evaluation Result Analysis

Knowledge Sources	Phoenix Phoenix Evals Evaluators
Domains	LLM Evaluation, Result Analysis, Observability
Last Updated	2026-02-14 00:00 GMT

Overview

Evaluation result analysis is the practice of interpreting, aggregating, and acting upon the structured scores produced by an evaluation pipeline to derive actionable insights about LLM application quality.

Description

Running evaluators produces raw Score objects (or their JSON-serialized representations in augmented DataFrames). These results are only valuable if they can be systematically analyzed to answer questions such as:

What percentage of responses were classified as relevant?
What is the average quality score across the dataset?
Which specific examples failed evaluation, and why?
How do scores differ between model versions or prompt iterations?
Are there patterns in the types of failures (e.g., hallucination correlates with long context)?

Evaluation result analysis provides the concepts and techniques for extracting these insights from the structured output of Phoenix evaluation pipelines.

Usage

Use evaluation result analysis when you need to:

Compute aggregate metrics (mean, median, distribution) across evaluation scores.
Filter and inspect individual evaluation failures and their explanations.
Compare evaluation results across different experimental conditions.
Feed evaluation results into monitoring dashboards or alerting systems.
Generate reports summarizing LLM application quality over time.

Theoretical Basis

Score Data Model

Every evaluation result is represented as a Score dataclass (frozen and immutable once created):

Score
  +-- name:        Optional[str]         # Evaluator/metric name
  +-- score:       Optional[float|int]   # Numeric value (None for label-only)
  +-- label:       Optional[str]         # Categorical classification label
  +-- explanation: Optional[str]         # LLM-generated reasoning
  +-- metadata:    Dict[str, Any]        # Arbitrary metadata (model, trace_id, etc.)
  +-- kind:        "human"|"llm"|"code"  # Source of the evaluation
  +-- direction:   "maximize"|"minimize" # How to interpret the score

The direction field is critical for analysis: a score of 0.1 is good when direction is "minimize" (e.g., error rate) but bad when direction is "maximize" (e.g., accuracy).

DataFrame Result Structure

After calling evaluate_dataframe(), the augmented DataFrame contains:

Column Pattern	Content	Analysis Use
`{evaluator.name}_execution_details`	JSON with `status`, `exceptions`, `execution_time_sec`	Identify failures, measure latency, compute success rates
`{score.name}_score`	JSON-serialized `Score.to_dict()`	Extract numeric scores, labels, explanations for aggregation

Analysis Dimensions

Evaluation results can be analyzed along several dimensions:

Aggregate statistics: Compute means, medians, standard deviations, and percentiles of numeric scores to understand overall quality levels.

Label distribution: For classification evaluators, count the frequency of each label to understand the distribution of outcomes (e.g., 80% relevant, 15% somewhat relevant, 5% not relevant).

Failure analysis: Filter rows where execution_details["status"] != "success" to identify and debug evaluation failures. Examine exceptions for root causes (rate limits, invalid inputs, model errors).

Explanation mining: When include_explanation=True, the LLM provides reasoning for its classification. These explanations can be aggregated, clustered, or manually reviewed to understand why certain outputs fail evaluation criteria.

Cross-evaluator correlation: When multiple evaluators are applied to the same dataset, their scores can be correlated to discover relationships (e.g., hallucinated responses tend to also score low on relevance).

Temporal tracking: By storing evaluation results over time (with timestamps and model version metadata), teams can track quality trends across deployments.

Interpreting Score Direction

Direction	Higher Score Means	Lower Score Means	Example Metrics
`"maximize"`	Better quality	Worse quality	Accuracy, relevance, fluency
`"minimize"`	Worse quality	Better quality	Error rate, hallucination frequency, toxicity

When aggregating scores from multiple evaluators with different directions, normalize scores to a common scale or separate the analysis by direction.

Related Pages

Implemented By

Implementation:Arize_ai_Phoenix_Score_Dataclass

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment