Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix Score Dataclass

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Result Analysis, Observability
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for representing, serializing, and analyzing individual evaluation results provided by arize-phoenix-evals, along with patterns for working with evaluation results stored in augmented DataFrames.

Description

The Score dataclass is the atomic unit of evaluation output in Phoenix. Every evaluator produces one or more Score instances per input record, encoding the evaluation result (numeric score, categorical label, free-text explanation), metadata about the evaluation source, and the optimization direction. After batch evaluation via evaluate_dataframe(), scores are serialized to JSON and stored in DataFrame columns alongside execution details. This implementation documents the Score API and the patterns for extracting, aggregating, and analyzing these results.

Usage

Use the Score dataclass and DataFrame analysis patterns when you need to:

  • Inspect individual evaluation results programmatically.
  • Serialize scores to JSON for storage, transmission, or dashboard rendering.
  • Extract scores from augmented DataFrames for aggregate statistics.
  • Filter evaluation results by status, label, score threshold, or metadata.
  • Pretty-print scores for debugging during development.

Code Reference

Source Location

  • Repository: Phoenix
  • File: packages/phoenix-evals/src/phoenix/evals/evaluators.py (lines 133-245)

Signature

@dataclass(frozen=True, init=False)
class Score:
    name: Optional[str] = None
    score: Optional[Union[float, int]] = None
    label: Optional[str] = None
    explanation: Optional[str] = None
    metadata: Dict[str, Any] = field(default_factory=dict)
    kind: Optional[KindType] = None        # "human", "llm", or "code"
    direction: DirectionType = "maximize"   # "maximize" or "minimize"

    def __init__(
        self,
        *,
        name: Optional[str] = None,
        score: Optional[Union[float, int]] = None,
        label: Optional[str] = None,
        explanation: Optional[str] = None,
        metadata: Optional[Dict[str, Any]] = None,
        direction: DirectionType = "maximize",
        kind: Optional[KindType] = None,
    ) -> None

Import

from phoenix.evals import Score

I/O Contract

Inputs (Constructor)

Name Type Required Description
name Optional[str] No Identifier for the score, typically matching the evaluator name. Used as the column name prefix in augmented DataFrames.
score Optional[Union[float, int]] No Numeric evaluation value. None for label-only classifications.
label Optional[str] No Categorical classification label (e.g., "relevant", "positive").
explanation Optional[str] No Free-text reasoning from the LLM or evaluator logic explaining the score.
metadata Optional[Dict[str, Any]] No (default {}) Arbitrary metadata dictionary. Common entries include "model", "trace_id", "confidence".
kind Optional[KindType] No Source type of the evaluation: "human", "llm", or "code".
direction DirectionType No (default "maximize") Score optimization direction: "maximize" (higher is better) or "minimize" (lower is better).

Outputs

Name Type Description
Score instance Score A frozen (immutable) dataclass instance containing all evaluation result data.

Key Methods

Method Signature Description
to_dict to_dict() -> Dict[str, Any] Converts the Score to a dictionary, excluding fields with None values. Used for JSON serialization in DataFrame columns.
pretty_print pretty_print(indent: int = 2) -> None Prints the Score as formatted JSON to stdout. Useful for debugging during development.

Usage Examples

Creating Score Objects

from phoenix.evals import Score

# Numeric score
accuracy_score = Score(
    name="accuracy",
    score=0.85,
    kind="llm",
    direction="maximize",
)

# Label-only classification
sentiment_score = Score(
    name="sentiment",
    label="positive",
    explanation="The text expresses enthusiasm and satisfaction.",
    metadata={"model": "gpt-4o"},
    kind="llm",
    direction="maximize",
)

# Boolean-style evaluation
has_citation = Score(
    name="has_citation",
    score=1.0,
    label="true",
    explanation="Found 3 citations in the text.",
    kind="code",
    direction="maximize",
)

Serializing Scores

from phoenix.evals import Score

score = Score(
    name="relevance",
    score=0.9,
    label="highly_relevant",
    explanation="The answer directly addresses the question.",
    metadata={"model": "gpt-4o", "confidence": 0.95},
    kind="llm",
    direction="maximize",
)

# Convert to dictionary (None values excluded)
score_dict = score.to_dict()
print(score_dict)
# {
#   "name": "relevance",
#   "score": 0.9,
#   "label": "highly_relevant",
#   "explanation": "The answer directly addresses the question.",
#   "metadata": {"model": "gpt-4o", "confidence": 0.95},
#   "kind": "llm",
#   "direction": "maximize"
# }

# Pretty print for debugging
score.pretty_print()

Extracting Scores from Augmented DataFrames

import json
import pandas as pd

# Assuming results_df is the output of evaluate_dataframe()

# Parse score column into dictionaries
scores = results_df["relevance_score"].apply(json.loads)

# Extract numeric scores
numeric_scores = scores.apply(lambda s: s.get("score"))
print(f"Mean relevance score: {numeric_scores.mean():.3f}")
print(f"Median relevance score: {numeric_scores.median():.3f}")

# Extract labels
labels = scores.apply(lambda s: s.get("label"))
print(f"Label distribution:\n{labels.value_counts()}")

Filtering by Execution Status

import json
import pandas as pd

# Check execution details for failures
details = results_df["relevance_execution_details"].apply(json.loads)
successful = details.apply(lambda d: d["status"] == "success")

print(f"Success rate: {successful.mean():.1%}")

# Filter to failed rows
failed_df = results_df[~successful]
for idx, row in failed_df.iterrows():
    detail = json.loads(row["relevance_execution_details"])
    print(f"Row {idx} failed: {detail['exceptions']}")

Aggregating Across Multiple Evaluators

import json
import pandas as pd

# Parse all score columns
evaluator_names = ["relevance", "hallucination", "toxicity"]

for name in evaluator_names:
    col = f"{name}_score"
    if col in results_df.columns:
        scores = results_df[col].apply(json.loads)
        numeric = scores.apply(lambda s: s.get("score"))
        labels = scores.apply(lambda s: s.get("label"))
        direction = scores.iloc[0].get("direction", "maximize")

        print(f"\n=== {name} (direction: {direction}) ===")
        print(f"  Mean score: {numeric.mean():.3f}")
        print(f"  Std dev:    {numeric.std():.3f}")
        print(f"  Labels:     {labels.value_counts().to_dict()}")

Normalizing Scores with pd.json_normalize

import json
import pandas as pd

# Flatten all score data into a structured DataFrame
score_records = results_df["relevance_score"].apply(json.loads)
scores_flat = pd.json_normalize(score_records)

# scores_flat now has columns: name, score, label, explanation, metadata.model, kind, direction
print(scores_flat.describe())
print(scores_flat["label"].value_counts())

Comparing Model Versions

import json
import pandas as pd

# Assuming two result DataFrames from different model runs
def extract_mean_score(df, evaluator_name):
    scores = df[f"{evaluator_name}_score"].apply(json.loads)
    return scores.apply(lambda s: s.get("score")).mean()

model_a_relevance = extract_mean_score(results_model_a, "relevance")
model_b_relevance = extract_mean_score(results_model_b, "relevance")

print(f"Model A mean relevance: {model_a_relevance:.3f}")
print(f"Model B mean relevance: {model_b_relevance:.3f}")
print(f"Delta: {model_b_relevance - model_a_relevance:+.3f}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment