Implementation:Arize ai Phoenix Annotation DataFrame Analysis
| Knowledge Sources | |
|---|---|
| Domains | AI Observability, Data Analysis, Quality Metrics |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete patterns for analyzing span annotation DataFrames using pandas operations in combination with the arize-phoenix-client package to derive quality insights from annotation data.
Description
This implementation documents the composable analysis patterns that operate on annotation DataFrames retrieved via client.spans.get_span_annotations_dataframe(). These are user-space patterns (not Phoenix library methods) that leverage standard pandas operations to transform raw annotation data into actionable insights. The patterns include score aggregation, quality filtering, cross-referencing with span data, dataset export, and inter-annotator agreement analysis.
The input to all patterns is a pandas DataFrame with the schema produced by get_span_annotations_dataframe(): indexed by span_id with columns annotation_name, annotator_kind, label, score, explanation, metadata, created_at, and updated_at.
Usage
Use these analysis patterns when:
- Computing aggregate quality metrics after running an evaluation pipeline.
- Filtering for low-quality spans that need human review or system remediation.
- Joining annotation scores with span attributes to find correlations between quality and system parameters.
- Comparing human and LLM annotations to validate automated evaluation.
- Exporting curated span subsets as Phoenix datasets for fine-tuning or testing.
Code Reference
Source Location
- Repository: User code (no specific Phoenix source file; these patterns compose Phoenix client APIs with pandas)
- Dependencies:
phoenix.client,pandas
Import
import pandas as pd
from phoenix.client import Client
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| annotations_df | pd.DataFrame |
Yes | The DataFrame returned by client.spans.get_span_annotations_dataframe(). Indexed by span_id with columns: annotation_name, annotator_kind, label, score, explanation, metadata, created_at, updated_at.
|
| spans_df | pd.DataFrame |
No | Optional spans DataFrame from client.spans.get_spans_dataframe() for cross-referencing. Used in join operations.
|
| threshold | float |
No | A score threshold used for quality filtering. Typically between 0.0 and 1.0. |
Outputs
| Name | Type | Description |
|---|---|---|
| score_summary | pd.Series or pd.DataFrame |
Aggregated score statistics (mean, median, std) grouped by annotation name or annotator kind. |
| low_quality_spans | pd.DataFrame |
Filtered subset of annotations where scores fall below the specified threshold. |
| enriched_df | pd.DataFrame |
Joined DataFrame combining annotation data with span attributes. |
| agreement_metrics | float or pd.DataFrame |
Inter-annotator agreement scores (correlation coefficients, Cohen's kappa, etc.). |
Usage Examples
Pattern 1: Aggregate Scores by Annotation Name
import pandas as pd
from phoenix.client import Client
client = Client()
# Retrieve annotations
annotations_df = client.spans.get_span_annotations_dataframe(
span_ids=["span_001", "span_002", "span_003"],
project_identifier="my-project",
)
# Compute mean score per annotation dimension
mean_scores = annotations_df.groupby("annotation_name")["score"].mean()
print(mean_scores)
# annotation_name
# correctness 0.85
# relevance 0.72
# toxicity 0.05
# Full descriptive statistics per annotation
score_stats = annotations_df.groupby("annotation_name")["score"].describe()
print(score_stats)
Pattern 2: Filter Low-Quality Spans
import pandas as pd
from phoenix.client import Client
client = Client()
annotations_df = client.spans.get_span_annotations_dataframe(
span_ids=["span_001", "span_002", "span_003", "span_004"],
project_identifier="my-project",
include_annotation_names=["quality"],
)
# Find spans with quality scores below 0.5
threshold = 0.5
low_quality = annotations_df[annotations_df["score"] < threshold]
print(f"Found {len(low_quality)} low-quality annotations")
print(f"Affected span IDs: {low_quality.index.unique().tolist()}")
# Get the label distribution for low-quality spans
label_dist = low_quality["label"].value_counts()
print(label_dist)
Pattern 3: Cross-Reference with Span Data
import pandas as pd
from phoenix.client import Client
client = Client()
# Get spans and their annotations
spans_df = client.spans.get_spans_dataframe(
project_identifier="my-project",
limit=1000,
)
annotations_df = client.spans.get_span_annotations_dataframe(
spans_dataframe=spans_df,
project_identifier="my-project",
include_annotation_names=["relevance"],
)
# Join annotations with span attributes
# spans_df typically uses "context.span_id" as a column
spans_indexed = spans_df.set_index("context.span_id")
enriched = annotations_df.join(spans_indexed[["name", "latency_ms", "status_code"]])
# Analyze: do slower spans have lower relevance scores?
print(enriched[["score", "latency_ms"]].corr())
# Group by span name to find which operations have the lowest quality
quality_by_operation = enriched.groupby("name")["score"].mean().sort_values()
print(quality_by_operation)
Pattern 4: Export Filtered Spans to a Dataset
import pandas as pd
from phoenix.client import Client
client = Client()
annotations_df = client.spans.get_span_annotations_dataframe(
span_ids=["span_001", "span_002", "span_003", "span_004", "span_005"],
project_identifier="my-project",
include_annotation_names=["correctness"],
)
# Select high-quality spans for a golden dataset
high_quality_span_ids = annotations_df[
annotations_df["score"] >= 0.9
].index.unique().tolist()
# Use the filtered span IDs to create a dataset in Phoenix
# (Retrieve span data first, then export)
spans = client.spans.get_spans(
project_identifier="my-project",
)
golden_spans = [s for s in spans if s["context"]["span_id"] in high_quality_span_ids]
print(f"Selected {len(golden_spans)} high-quality spans for the golden dataset")
Pattern 5: Compare Annotator Agreement
import pandas as pd
from phoenix.client import Client
client = Client()
annotations_df = client.spans.get_span_annotations_dataframe(
span_ids=["span_001", "span_002", "span_003"],
project_identifier="my-project",
include_annotation_names=["relevance"],
)
# Pivot to compare scores from different annotator kinds
pivot = annotations_df.pivot_table(
values="score",
index=annotations_df.index, # span_id
columns="annotator_kind",
aggfunc="mean",
)
# Compute correlation between HUMAN and LLM scores
if "HUMAN" in pivot.columns and "LLM" in pivot.columns:
valid = pivot[["HUMAN", "LLM"]].dropna()
correlation = valid["HUMAN"].corr(valid["LLM"])
print(f"Human-LLM score correlation: {correlation:.3f}")
# Compute mean absolute difference
mad = (valid["HUMAN"] - valid["LLM"]).abs().mean()
print(f"Mean absolute difference: {mad:.3f}")
# Label agreement analysis
label_pivot = annotations_df.pivot_table(
values="label",
index=annotations_df.index,
columns="annotator_kind",
aggfunc="first",
)
if "HUMAN" in label_pivot.columns and "LLM" in label_pivot.columns:
valid_labels = label_pivot[["HUMAN", "LLM"]].dropna()
agreement_rate = (valid_labels["HUMAN"] == valid_labels["LLM"]).mean()
print(f"Label agreement rate: {agreement_rate:.1%}")