Implementation:Arize ai Phoenix Annotation DataFrame Analysis

Knowledge Sources	Phoenix pandas DataFrame Operations
Domains	AI Observability, Data Analysis, Quality Metrics
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete patterns for analyzing span annotation DataFrames using pandas operations in combination with the arize-phoenix-client package to derive quality insights from annotation data.

Description

This implementation documents the composable analysis patterns that operate on annotation DataFrames retrieved via client.spans.get_span_annotations_dataframe(). These are user-space patterns (not Phoenix library methods) that leverage standard pandas operations to transform raw annotation data into actionable insights. The patterns include score aggregation, quality filtering, cross-referencing with span data, dataset export, and inter-annotator agreement analysis.

The input to all patterns is a pandas DataFrame with the schema produced by get_span_annotations_dataframe(): indexed by span_id with columns annotation_name, annotator_kind, label, score, explanation, metadata, created_at, and updated_at.

Usage

Use these analysis patterns when:

Computing aggregate quality metrics after running an evaluation pipeline.
Filtering for low-quality spans that need human review or system remediation.
Joining annotation scores with span attributes to find correlations between quality and system parameters.
Comparing human and LLM annotations to validate automated evaluation.
Exporting curated span subsets as Phoenix datasets for fine-tuning or testing.

Code Reference

Source Location

Repository: User code (no specific Phoenix source file; these patterns compose Phoenix client APIs with pandas)
Dependencies: phoenix.client, pandas

Import

import pandas as pd
from phoenix.client import Client

I/O Contract

Inputs

Name	Type	Required	Description
annotations_df	`pd.DataFrame`	Yes	The DataFrame returned by `client.spans.get_span_annotations_dataframe()`. Indexed by `span_id` with columns: `annotation_name`, `annotator_kind`, `label`, `score`, `explanation`, `metadata`, `created_at`, `updated_at`.
spans_df	`pd.DataFrame`	No	Optional spans DataFrame from `client.spans.get_spans_dataframe()` for cross-referencing. Used in join operations.
threshold	`float`	No	A score threshold used for quality filtering. Typically between 0.0 and 1.0.

Outputs

Name	Type	Description
score_summary	`pd.Series` or `pd.DataFrame`	Aggregated score statistics (mean, median, std) grouped by annotation name or annotator kind.
low_quality_spans	`pd.DataFrame`	Filtered subset of annotations where scores fall below the specified threshold.
enriched_df	`pd.DataFrame`	Joined DataFrame combining annotation data with span attributes.
agreement_metrics	`float` or `pd.DataFrame`	Inter-annotator agreement scores (correlation coefficients, Cohen's kappa, etc.).

Usage Examples

Pattern 1: Aggregate Scores by Annotation Name

import pandas as pd
from phoenix.client import Client

client = Client()

# Retrieve annotations
annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span_001", "span_002", "span_003"],
    project_identifier="my-project",
)

# Compute mean score per annotation dimension
mean_scores = annotations_df.groupby("annotation_name")["score"].mean()
print(mean_scores)
# annotation_name
# correctness    0.85
# relevance      0.72
# toxicity       0.05

# Full descriptive statistics per annotation
score_stats = annotations_df.groupby("annotation_name")["score"].describe()
print(score_stats)

Pattern 2: Filter Low-Quality Spans

import pandas as pd
from phoenix.client import Client

client = Client()

annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span_001", "span_002", "span_003", "span_004"],
    project_identifier="my-project",
    include_annotation_names=["quality"],
)

# Find spans with quality scores below 0.5
threshold = 0.5
low_quality = annotations_df[annotations_df["score"] < threshold]
print(f"Found {len(low_quality)} low-quality annotations")
print(f"Affected span IDs: {low_quality.index.unique().tolist()}")

# Get the label distribution for low-quality spans
label_dist = low_quality["label"].value_counts()
print(label_dist)

Pattern 3: Cross-Reference with Span Data

import pandas as pd
from phoenix.client import Client

client = Client()

# Get spans and their annotations
spans_df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    limit=1000,
)

annotations_df = client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df,
    project_identifier="my-project",
    include_annotation_names=["relevance"],
)

# Join annotations with span attributes
# spans_df typically uses "context.span_id" as a column
spans_indexed = spans_df.set_index("context.span_id")
enriched = annotations_df.join(spans_indexed[["name", "latency_ms", "status_code"]])

# Analyze: do slower spans have lower relevance scores?
print(enriched[["score", "latency_ms"]].corr())

# Group by span name to find which operations have the lowest quality
quality_by_operation = enriched.groupby("name")["score"].mean().sort_values()
print(quality_by_operation)

Pattern 4: Export Filtered Spans to a Dataset

import pandas as pd
from phoenix.client import Client

client = Client()

annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span_001", "span_002", "span_003", "span_004", "span_005"],
    project_identifier="my-project",
    include_annotation_names=["correctness"],
)

# Select high-quality spans for a golden dataset
high_quality_span_ids = annotations_df[
    annotations_df["score"] >= 0.9
].index.unique().tolist()

# Use the filtered span IDs to create a dataset in Phoenix
# (Retrieve span data first, then export)
spans = client.spans.get_spans(
    project_identifier="my-project",
)
golden_spans = [s for s in spans if s["context"]["span_id"] in high_quality_span_ids]

print(f"Selected {len(golden_spans)} high-quality spans for the golden dataset")

Pattern 5: Compare Annotator Agreement

import pandas as pd
from phoenix.client import Client

client = Client()

annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span_001", "span_002", "span_003"],
    project_identifier="my-project",
    include_annotation_names=["relevance"],
)

# Pivot to compare scores from different annotator kinds
pivot = annotations_df.pivot_table(
    values="score",
    index=annotations_df.index,  # span_id
    columns="annotator_kind",
    aggfunc="mean",
)

# Compute correlation between HUMAN and LLM scores
if "HUMAN" in pivot.columns and "LLM" in pivot.columns:
    valid = pivot[["HUMAN", "LLM"]].dropna()
    correlation = valid["HUMAN"].corr(valid["LLM"])
    print(f"Human-LLM score correlation: {correlation:.3f}")

    # Compute mean absolute difference
    mad = (valid["HUMAN"] - valid["LLM"]).abs().mean()
    print(f"Mean absolute difference: {mad:.3f}")

# Label agreement analysis
label_pivot = annotations_df.pivot_table(
    values="label",
    index=annotations_df.index,
    columns="annotator_kind",
    aggfunc="first",
)

if "HUMAN" in label_pivot.columns and "LLM" in label_pivot.columns:
    valid_labels = label_pivot[["HUMAN", "LLM"]].dropna()
    agreement_rate = (valid_labels["HUMAN"] == valid_labels["LLM"]).mean()
    print(f"Label agreement rate: {agreement_rate:.1%}")

Related Pages

Implements Principle

Principle:Arize_ai_Phoenix_Annotation_Analysis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment