Principle:Arize ai Phoenix Evaluation Result Analysis
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Result Analysis, Observability |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Evaluation result analysis is the practice of interpreting, aggregating, and acting upon the structured scores produced by an evaluation pipeline to derive actionable insights about LLM application quality.
Description
Running evaluators produces raw Score objects (or their JSON-serialized representations in augmented DataFrames). These results are only valuable if they can be systematically analyzed to answer questions such as:
- What percentage of responses were classified as relevant?
- What is the average quality score across the dataset?
- Which specific examples failed evaluation, and why?
- How do scores differ between model versions or prompt iterations?
- Are there patterns in the types of failures (e.g., hallucination correlates with long context)?
Evaluation result analysis provides the concepts and techniques for extracting these insights from the structured output of Phoenix evaluation pipelines.
Usage
Use evaluation result analysis when you need to:
- Compute aggregate metrics (mean, median, distribution) across evaluation scores.
- Filter and inspect individual evaluation failures and their explanations.
- Compare evaluation results across different experimental conditions.
- Feed evaluation results into monitoring dashboards or alerting systems.
- Generate reports summarizing LLM application quality over time.
Theoretical Basis
Score Data Model
Every evaluation result is represented as a Score dataclass (frozen and immutable once created):
Score
+-- name: Optional[str] # Evaluator/metric name
+-- score: Optional[float|int] # Numeric value (None for label-only)
+-- label: Optional[str] # Categorical classification label
+-- explanation: Optional[str] # LLM-generated reasoning
+-- metadata: Dict[str, Any] # Arbitrary metadata (model, trace_id, etc.)
+-- kind: "human"|"llm"|"code" # Source of the evaluation
+-- direction: "maximize"|"minimize" # How to interpret the score
The direction field is critical for analysis: a score of 0.1 is good when direction is "minimize" (e.g., error rate) but bad when direction is "maximize" (e.g., accuracy).
DataFrame Result Structure
After calling evaluate_dataframe(), the augmented DataFrame contains:
| Column Pattern | Content | Analysis Use |
|---|---|---|
{evaluator.name}_execution_details |
JSON with status, exceptions, execution_time_sec |
Identify failures, measure latency, compute success rates |
{score.name}_score |
JSON-serialized Score.to_dict() |
Extract numeric scores, labels, explanations for aggregation |
Analysis Dimensions
Evaluation results can be analyzed along several dimensions:
Aggregate statistics: Compute means, medians, standard deviations, and percentiles of numeric scores to understand overall quality levels.
Label distribution: For classification evaluators, count the frequency of each label to understand the distribution of outcomes (e.g., 80% relevant, 15% somewhat relevant, 5% not relevant).
Failure analysis: Filter rows where execution_details["status"] != "success" to identify and debug evaluation failures. Examine exceptions for root causes (rate limits, invalid inputs, model errors).
Explanation mining: When include_explanation=True, the LLM provides reasoning for its classification. These explanations can be aggregated, clustered, or manually reviewed to understand why certain outputs fail evaluation criteria.
Cross-evaluator correlation: When multiple evaluators are applied to the same dataset, their scores can be correlated to discover relationships (e.g., hallucinated responses tend to also score low on relevance).
Temporal tracking: By storing evaluation results over time (with timestamps and model version metadata), teams can track quality trends across deployments.
Interpreting Score Direction
| Direction | Higher Score Means | Lower Score Means | Example Metrics |
|---|---|---|---|
"maximize" |
Better quality | Worse quality | Accuracy, relevance, fluency |
"minimize" |
Worse quality | Better quality | Error rate, hallucination frequency, toxicity |
When aggregating scores from multiple evaluators with different directions, normalize scores to a common scale or separate the analysis by direction.