Overview
Concrete tool for representing, serializing, and analyzing individual evaluation results provided by arize-phoenix-evals, along with patterns for working with evaluation results stored in augmented DataFrames.
Description
The Score dataclass is the atomic unit of evaluation output in Phoenix. Every evaluator produces one or more Score instances per input record, encoding the evaluation result (numeric score, categorical label, free-text explanation), metadata about the evaluation source, and the optimization direction. After batch evaluation via evaluate_dataframe(), scores are serialized to JSON and stored in DataFrame columns alongside execution details. This implementation documents the Score API and the patterns for extracting, aggregating, and analyzing these results.
Usage
Use the Score dataclass and DataFrame analysis patterns when you need to:
- Inspect individual evaluation results programmatically.
- Serialize scores to JSON for storage, transmission, or dashboard rendering.
- Extract scores from augmented DataFrames for aggregate statistics.
- Filter evaluation results by status, label, score threshold, or metadata.
- Pretty-print scores for debugging during development.
Code Reference
Source Location
- Repository: Phoenix
- File:
packages/phoenix-evals/src/phoenix/evals/evaluators.py (lines 133-245)
Signature
@dataclass(frozen=True, init=False)
class Score:
name: Optional[str] = None
score: Optional[Union[float, int]] = None
label: Optional[str] = None
explanation: Optional[str] = None
metadata: Dict[str, Any] = field(default_factory=dict)
kind: Optional[KindType] = None # "human", "llm", or "code"
direction: DirectionType = "maximize" # "maximize" or "minimize"
def __init__(
self,
*,
name: Optional[str] = None,
score: Optional[Union[float, int]] = None,
label: Optional[str] = None,
explanation: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None,
direction: DirectionType = "maximize",
kind: Optional[KindType] = None,
) -> None
Import
from phoenix.evals import Score
I/O Contract
Inputs (Constructor)
| Name |
Type |
Required |
Description
|
| name |
Optional[str] |
No |
Identifier for the score, typically matching the evaluator name. Used as the column name prefix in augmented DataFrames.
|
| score |
Optional[Union[float, int]] |
No |
Numeric evaluation value. None for label-only classifications.
|
| label |
Optional[str] |
No |
Categorical classification label (e.g., "relevant", "positive").
|
| explanation |
Optional[str] |
No |
Free-text reasoning from the LLM or evaluator logic explaining the score.
|
| metadata |
Optional[Dict[str, Any]] |
No (default {}) |
Arbitrary metadata dictionary. Common entries include "model", "trace_id", "confidence".
|
| kind |
Optional[KindType] |
No |
Source type of the evaluation: "human", "llm", or "code".
|
| direction |
DirectionType |
No (default "maximize") |
Score optimization direction: "maximize" (higher is better) or "minimize" (lower is better).
|
Outputs
| Name |
Type |
Description
|
| Score instance |
Score |
A frozen (immutable) dataclass instance containing all evaluation result data.
|
Key Methods
| Method |
Signature |
Description
|
| to_dict |
to_dict() -> Dict[str, Any] |
Converts the Score to a dictionary, excluding fields with None values. Used for JSON serialization in DataFrame columns.
|
| pretty_print |
pretty_print(indent: int = 2) -> None |
Prints the Score as formatted JSON to stdout. Useful for debugging during development.
|
Usage Examples
Creating Score Objects
from phoenix.evals import Score
# Numeric score
accuracy_score = Score(
name="accuracy",
score=0.85,
kind="llm",
direction="maximize",
)
# Label-only classification
sentiment_score = Score(
name="sentiment",
label="positive",
explanation="The text expresses enthusiasm and satisfaction.",
metadata={"model": "gpt-4o"},
kind="llm",
direction="maximize",
)
# Boolean-style evaluation
has_citation = Score(
name="has_citation",
score=1.0,
label="true",
explanation="Found 3 citations in the text.",
kind="code",
direction="maximize",
)
Serializing Scores
from phoenix.evals import Score
score = Score(
name="relevance",
score=0.9,
label="highly_relevant",
explanation="The answer directly addresses the question.",
metadata={"model": "gpt-4o", "confidence": 0.95},
kind="llm",
direction="maximize",
)
# Convert to dictionary (None values excluded)
score_dict = score.to_dict()
print(score_dict)
# {
# "name": "relevance",
# "score": 0.9,
# "label": "highly_relevant",
# "explanation": "The answer directly addresses the question.",
# "metadata": {"model": "gpt-4o", "confidence": 0.95},
# "kind": "llm",
# "direction": "maximize"
# }
# Pretty print for debugging
score.pretty_print()
import json
import pandas as pd
# Assuming results_df is the output of evaluate_dataframe()
# Parse score column into dictionaries
scores = results_df["relevance_score"].apply(json.loads)
# Extract numeric scores
numeric_scores = scores.apply(lambda s: s.get("score"))
print(f"Mean relevance score: {numeric_scores.mean():.3f}")
print(f"Median relevance score: {numeric_scores.median():.3f}")
# Extract labels
labels = scores.apply(lambda s: s.get("label"))
print(f"Label distribution:\n{labels.value_counts()}")
Filtering by Execution Status
import json
import pandas as pd
# Check execution details for failures
details = results_df["relevance_execution_details"].apply(json.loads)
successful = details.apply(lambda d: d["status"] == "success")
print(f"Success rate: {successful.mean():.1%}")
# Filter to failed rows
failed_df = results_df[~successful]
for idx, row in failed_df.iterrows():
detail = json.loads(row["relevance_execution_details"])
print(f"Row {idx} failed: {detail['exceptions']}")
Aggregating Across Multiple Evaluators
import json
import pandas as pd
# Parse all score columns
evaluator_names = ["relevance", "hallucination", "toxicity"]
for name in evaluator_names:
col = f"{name}_score"
if col in results_df.columns:
scores = results_df[col].apply(json.loads)
numeric = scores.apply(lambda s: s.get("score"))
labels = scores.apply(lambda s: s.get("label"))
direction = scores.iloc[0].get("direction", "maximize")
print(f"\n=== {name} (direction: {direction}) ===")
print(f" Mean score: {numeric.mean():.3f}")
print(f" Std dev: {numeric.std():.3f}")
print(f" Labels: {labels.value_counts().to_dict()}")
Normalizing Scores with pd.json_normalize
import json
import pandas as pd
# Flatten all score data into a structured DataFrame
score_records = results_df["relevance_score"].apply(json.loads)
scores_flat = pd.json_normalize(score_records)
# scores_flat now has columns: name, score, label, explanation, metadata.model, kind, direction
print(scores_flat.describe())
print(scores_flat["label"].value_counts())
Comparing Model Versions
import json
import pandas as pd
# Assuming two result DataFrames from different model runs
def extract_mean_score(df, evaluator_name):
scores = df[f"{evaluator_name}_score"].apply(json.loads)
return scores.apply(lambda s: s.get("score")).mean()
model_a_relevance = extract_mean_score(results_model_a, "relevance")
model_b_relevance = extract_mean_score(results_model_b, "relevance")
print(f"Model A mean relevance: {model_a_relevance:.3f}")
print(f"Model B mean relevance: {model_b_relevance:.3f}")
print(f"Delta: {model_b_relevance - model_a_relevance:+.3f}")
Related Pages
Implements Principle