Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Arize ai Phoenix Evaluation Result Analysis

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Result Analysis, Observability
Last Updated 2026-02-14 00:00 GMT

Overview

Evaluation result analysis is the practice of interpreting, aggregating, and acting upon the structured scores produced by an evaluation pipeline to derive actionable insights about LLM application quality.

Description

Running evaluators produces raw Score objects (or their JSON-serialized representations in augmented DataFrames). These results are only valuable if they can be systematically analyzed to answer questions such as:

  • What percentage of responses were classified as relevant?
  • What is the average quality score across the dataset?
  • Which specific examples failed evaluation, and why?
  • How do scores differ between model versions or prompt iterations?
  • Are there patterns in the types of failures (e.g., hallucination correlates with long context)?

Evaluation result analysis provides the concepts and techniques for extracting these insights from the structured output of Phoenix evaluation pipelines.

Usage

Use evaluation result analysis when you need to:

  • Compute aggregate metrics (mean, median, distribution) across evaluation scores.
  • Filter and inspect individual evaluation failures and their explanations.
  • Compare evaluation results across different experimental conditions.
  • Feed evaluation results into monitoring dashboards or alerting systems.
  • Generate reports summarizing LLM application quality over time.

Theoretical Basis

Score Data Model

Every evaluation result is represented as a Score dataclass (frozen and immutable once created):

Score
  +-- name:        Optional[str]         # Evaluator/metric name
  +-- score:       Optional[float|int]   # Numeric value (None for label-only)
  +-- label:       Optional[str]         # Categorical classification label
  +-- explanation: Optional[str]         # LLM-generated reasoning
  +-- metadata:    Dict[str, Any]        # Arbitrary metadata (model, trace_id, etc.)
  +-- kind:        "human"|"llm"|"code"  # Source of the evaluation
  +-- direction:   "maximize"|"minimize" # How to interpret the score

The direction field is critical for analysis: a score of 0.1 is good when direction is "minimize" (e.g., error rate) but bad when direction is "maximize" (e.g., accuracy).

DataFrame Result Structure

After calling evaluate_dataframe(), the augmented DataFrame contains:

Column Pattern Content Analysis Use
{evaluator.name}_execution_details JSON with status, exceptions, execution_time_sec Identify failures, measure latency, compute success rates
{score.name}_score JSON-serialized Score.to_dict() Extract numeric scores, labels, explanations for aggregation

Analysis Dimensions

Evaluation results can be analyzed along several dimensions:

Aggregate statistics: Compute means, medians, standard deviations, and percentiles of numeric scores to understand overall quality levels.

Label distribution: For classification evaluators, count the frequency of each label to understand the distribution of outcomes (e.g., 80% relevant, 15% somewhat relevant, 5% not relevant).

Failure analysis: Filter rows where execution_details["status"] != "success" to identify and debug evaluation failures. Examine exceptions for root causes (rate limits, invalid inputs, model errors).

Explanation mining: When include_explanation=True, the LLM provides reasoning for its classification. These explanations can be aggregated, clustered, or manually reviewed to understand why certain outputs fail evaluation criteria.

Cross-evaluator correlation: When multiple evaluators are applied to the same dataset, their scores can be correlated to discover relationships (e.g., hallucinated responses tend to also score low on relevance).

Temporal tracking: By storing evaluation results over time (with timestamps and model version metadata), teams can track quality trends across deployments.

Interpreting Score Direction

Direction Higher Score Means Lower Score Means Example Metrics
"maximize" Better quality Worse quality Accuracy, relevance, fluency
"minimize" Worse quality Better quality Error rate, hallucination frequency, toxicity

When aggregating scores from multiple evaluators with different directions, normalize scores to a common scale or separate the analysis by direction.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment