Implementation:Arize ai Phoenix Score Dataclass

Knowledge Sources	Phoenix
Domains	LLM Evaluation, Result Analysis, Observability
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for representing, serializing, and analyzing individual evaluation results provided by arize-phoenix-evals, along with patterns for working with evaluation results stored in augmented DataFrames.

Description

The Score dataclass is the atomic unit of evaluation output in Phoenix. Every evaluator produces one or more Score instances per input record, encoding the evaluation result (numeric score, categorical label, free-text explanation), metadata about the evaluation source, and the optimization direction. After batch evaluation via evaluate_dataframe(), scores are serialized to JSON and stored in DataFrame columns alongside execution details. This implementation documents the Score API and the patterns for extracting, aggregating, and analyzing these results.

Usage

Use the Score dataclass and DataFrame analysis patterns when you need to:

Inspect individual evaluation results programmatically.
Serialize scores to JSON for storage, transmission, or dashboard rendering.
Extract scores from augmented DataFrames for aggregate statistics.
Filter evaluation results by status, label, score threshold, or metadata.
Pretty-print scores for debugging during development.

Code Reference

Source Location

Repository: Phoenix
File: packages/phoenix-evals/src/phoenix/evals/evaluators.py (lines 133-245)

Signature

@dataclass(frozen=True, init=False)
class Score:
    name: Optional[str] = None
    score: Optional[Union[float, int]] = None
    label: Optional[str] = None
    explanation: Optional[str] = None
    metadata: Dict[str, Any] = field(default_factory=dict)
    kind: Optional[KindType] = None        # "human", "llm", or "code"
    direction: DirectionType = "maximize"   # "maximize" or "minimize"

    def __init__(
        self,
        *,
        name: Optional[str] = None,
        score: Optional[Union[float, int]] = None,
        label: Optional[str] = None,
        explanation: Optional[str] = None,
        metadata: Optional[Dict[str, Any]] = None,
        direction: DirectionType = "maximize",
        kind: Optional[KindType] = None,
    ) -> None

Import

from phoenix.evals import Score

I/O Contract

Inputs (Constructor)

Name	Type	Required	Description
name	`Optional[str]`	No	Identifier for the score, typically matching the evaluator name. Used as the column name prefix in augmented DataFrames.
score	`Optional[Union[float, int]]`	No	Numeric evaluation value. `None` for label-only classifications.
label	`Optional[str]`	No	Categorical classification label (e.g., `"relevant"`, `"positive"`).
explanation	`Optional[str]`	No	Free-text reasoning from the LLM or evaluator logic explaining the score.
metadata	`Optional[Dict[str, Any]]`	No (default `{}`)	Arbitrary metadata dictionary. Common entries include `"model"`, `"trace_id"`, `"confidence"`.
kind	`Optional[KindType]`	No	Source type of the evaluation: `"human"`, `"llm"`, or `"code"`.
direction	`DirectionType`	No (default `"maximize"`)	Score optimization direction: `"maximize"` (higher is better) or `"minimize"` (lower is better).

Outputs

Name	Type	Description
Score instance	`Score`	A frozen (immutable) dataclass instance containing all evaluation result data.

Key Methods

Method	Signature	Description
to_dict	`to_dict() -> Dict[str, Any]`	Converts the Score to a dictionary, excluding fields with `None` values. Used for JSON serialization in DataFrame columns.
pretty_print	`pretty_print(indent: int = 2) -> None`	Prints the Score as formatted JSON to stdout. Useful for debugging during development.

Usage Examples

Creating Score Objects

from phoenix.evals import Score

# Numeric score
accuracy_score = Score(
    name="accuracy",
    score=0.85,
    kind="llm",
    direction="maximize",
)

# Label-only classification
sentiment_score = Score(
    name="sentiment",
    label="positive",
    explanation="The text expresses enthusiasm and satisfaction.",
    metadata={"model": "gpt-4o"},
    kind="llm",
    direction="maximize",
)

# Boolean-style evaluation
has_citation = Score(
    name="has_citation",
    score=1.0,
    label="true",
    explanation="Found 3 citations in the text.",
    kind="code",
    direction="maximize",
)

Serializing Scores

from phoenix.evals import Score

score = Score(
    name="relevance",
    score=0.9,
    label="highly_relevant",
    explanation="The answer directly addresses the question.",
    metadata={"model": "gpt-4o", "confidence": 0.95},
    kind="llm",
    direction="maximize",
)

# Convert to dictionary (None values excluded)
score_dict = score.to_dict()
print(score_dict)
# {
#   "name": "relevance",
#   "score": 0.9,
#   "label": "highly_relevant",
#   "explanation": "The answer directly addresses the question.",
#   "metadata": {"model": "gpt-4o", "confidence": 0.95},
#   "kind": "llm",
#   "direction": "maximize"
# }

# Pretty print for debugging
score.pretty_print()

Extracting Scores from Augmented DataFrames

import json
import pandas as pd

# Assuming results_df is the output of evaluate_dataframe()

# Parse score column into dictionaries
scores = results_df["relevance_score"].apply(json.loads)

# Extract numeric scores
numeric_scores = scores.apply(lambda s: s.get("score"))
print(f"Mean relevance score: {numeric_scores.mean():.3f}")
print(f"Median relevance score: {numeric_scores.median():.3f}")

# Extract labels
labels = scores.apply(lambda s: s.get("label"))
print(f"Label distribution:\n{labels.value_counts()}")

Filtering by Execution Status

import json
import pandas as pd

# Check execution details for failures
details = results_df["relevance_execution_details"].apply(json.loads)
successful = details.apply(lambda d: d["status"] == "success")

print(f"Success rate: {successful.mean():.1%}")

# Filter to failed rows
failed_df = results_df[~successful]
for idx, row in failed_df.iterrows():
    detail = json.loads(row["relevance_execution_details"])
    print(f"Row {idx} failed: {detail['exceptions']}")

Aggregating Across Multiple Evaluators

import json
import pandas as pd

# Parse all score columns
evaluator_names = ["relevance", "hallucination", "toxicity"]

for name in evaluator_names:
    col = f"{name}_score"
    if col in results_df.columns:
        scores = results_df[col].apply(json.loads)
        numeric = scores.apply(lambda s: s.get("score"))
        labels = scores.apply(lambda s: s.get("label"))
        direction = scores.iloc[0].get("direction", "maximize")

        print(f"\n=== {name} (direction: {direction}) ===")
        print(f"  Mean score: {numeric.mean():.3f}")
        print(f"  Std dev:    {numeric.std():.3f}")
        print(f"  Labels:     {labels.value_counts().to_dict()}")

Normalizing Scores with pd.json_normalize

import json
import pandas as pd

# Flatten all score data into a structured DataFrame
score_records = results_df["relevance_score"].apply(json.loads)
scores_flat = pd.json_normalize(score_records)

# scores_flat now has columns: name, score, label, explanation, metadata.model, kind, direction
print(scores_flat.describe())
print(scores_flat["label"].value_counts())

Comparing Model Versions

import json
import pandas as pd

# Assuming two result DataFrames from different model runs
def extract_mean_score(df, evaluator_name):
    scores = df[f"{evaluator_name}_score"].apply(json.loads)
    return scores.apply(lambda s: s.get("score")).mean()

model_a_relevance = extract_mean_score(results_model_a, "relevance")
model_b_relevance = extract_mean_score(results_model_b, "relevance")

print(f"Model A mean relevance: {model_a_relevance:.3f}")
print(f"Model B mean relevance: {model_b_relevance:.3f}")
print(f"Delta: {model_b_relevance - model_a_relevance:+.3f}")

Related Pages

Implements Principle

Principle:Arize_ai_Phoenix_Evaluation_Result_Analysis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment