Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix Annotation DataFrame Analysis

From Leeroopedia
Revision as of 12:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Arize_ai_Phoenix_Annotation_DataFrame_Analysis.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains AI Observability, Data Analysis, Quality Metrics
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete patterns for analyzing span annotation DataFrames using pandas operations in combination with the arize-phoenix-client package to derive quality insights from annotation data.

Description

This implementation documents the composable analysis patterns that operate on annotation DataFrames retrieved via client.spans.get_span_annotations_dataframe(). These are user-space patterns (not Phoenix library methods) that leverage standard pandas operations to transform raw annotation data into actionable insights. The patterns include score aggregation, quality filtering, cross-referencing with span data, dataset export, and inter-annotator agreement analysis.

The input to all patterns is a pandas DataFrame with the schema produced by get_span_annotations_dataframe(): indexed by span_id with columns annotation_name, annotator_kind, label, score, explanation, metadata, created_at, and updated_at.

Usage

Use these analysis patterns when:

  • Computing aggregate quality metrics after running an evaluation pipeline.
  • Filtering for low-quality spans that need human review or system remediation.
  • Joining annotation scores with span attributes to find correlations between quality and system parameters.
  • Comparing human and LLM annotations to validate automated evaluation.
  • Exporting curated span subsets as Phoenix datasets for fine-tuning or testing.

Code Reference

Source Location

  • Repository: User code (no specific Phoenix source file; these patterns compose Phoenix client APIs with pandas)
  • Dependencies: phoenix.client, pandas

Import

import pandas as pd
from phoenix.client import Client

I/O Contract

Inputs

Name Type Required Description
annotations_df pd.DataFrame Yes The DataFrame returned by client.spans.get_span_annotations_dataframe(). Indexed by span_id with columns: annotation_name, annotator_kind, label, score, explanation, metadata, created_at, updated_at.
spans_df pd.DataFrame No Optional spans DataFrame from client.spans.get_spans_dataframe() for cross-referencing. Used in join operations.
threshold float No A score threshold used for quality filtering. Typically between 0.0 and 1.0.

Outputs

Name Type Description
score_summary pd.Series or pd.DataFrame Aggregated score statistics (mean, median, std) grouped by annotation name or annotator kind.
low_quality_spans pd.DataFrame Filtered subset of annotations where scores fall below the specified threshold.
enriched_df pd.DataFrame Joined DataFrame combining annotation data with span attributes.
agreement_metrics float or pd.DataFrame Inter-annotator agreement scores (correlation coefficients, Cohen's kappa, etc.).

Usage Examples

Pattern 1: Aggregate Scores by Annotation Name

import pandas as pd
from phoenix.client import Client

client = Client()

# Retrieve annotations
annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span_001", "span_002", "span_003"],
    project_identifier="my-project",
)

# Compute mean score per annotation dimension
mean_scores = annotations_df.groupby("annotation_name")["score"].mean()
print(mean_scores)
# annotation_name
# correctness    0.85
# relevance      0.72
# toxicity       0.05

# Full descriptive statistics per annotation
score_stats = annotations_df.groupby("annotation_name")["score"].describe()
print(score_stats)

Pattern 2: Filter Low-Quality Spans

import pandas as pd
from phoenix.client import Client

client = Client()

annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span_001", "span_002", "span_003", "span_004"],
    project_identifier="my-project",
    include_annotation_names=["quality"],
)

# Find spans with quality scores below 0.5
threshold = 0.5
low_quality = annotations_df[annotations_df["score"] < threshold]
print(f"Found {len(low_quality)} low-quality annotations")
print(f"Affected span IDs: {low_quality.index.unique().tolist()}")

# Get the label distribution for low-quality spans
label_dist = low_quality["label"].value_counts()
print(label_dist)

Pattern 3: Cross-Reference with Span Data

import pandas as pd
from phoenix.client import Client

client = Client()

# Get spans and their annotations
spans_df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    limit=1000,
)

annotations_df = client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df,
    project_identifier="my-project",
    include_annotation_names=["relevance"],
)

# Join annotations with span attributes
# spans_df typically uses "context.span_id" as a column
spans_indexed = spans_df.set_index("context.span_id")
enriched = annotations_df.join(spans_indexed[["name", "latency_ms", "status_code"]])

# Analyze: do slower spans have lower relevance scores?
print(enriched[["score", "latency_ms"]].corr())

# Group by span name to find which operations have the lowest quality
quality_by_operation = enriched.groupby("name")["score"].mean().sort_values()
print(quality_by_operation)

Pattern 4: Export Filtered Spans to a Dataset

import pandas as pd
from phoenix.client import Client

client = Client()

annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span_001", "span_002", "span_003", "span_004", "span_005"],
    project_identifier="my-project",
    include_annotation_names=["correctness"],
)

# Select high-quality spans for a golden dataset
high_quality_span_ids = annotations_df[
    annotations_df["score"] >= 0.9
].index.unique().tolist()

# Use the filtered span IDs to create a dataset in Phoenix
# (Retrieve span data first, then export)
spans = client.spans.get_spans(
    project_identifier="my-project",
)
golden_spans = [s for s in spans if s["context"]["span_id"] in high_quality_span_ids]

print(f"Selected {len(golden_spans)} high-quality spans for the golden dataset")

Pattern 5: Compare Annotator Agreement

import pandas as pd
from phoenix.client import Client

client = Client()

annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span_001", "span_002", "span_003"],
    project_identifier="my-project",
    include_annotation_names=["relevance"],
)

# Pivot to compare scores from different annotator kinds
pivot = annotations_df.pivot_table(
    values="score",
    index=annotations_df.index,  # span_id
    columns="annotator_kind",
    aggfunc="mean",
)

# Compute correlation between HUMAN and LLM scores
if "HUMAN" in pivot.columns and "LLM" in pivot.columns:
    valid = pivot[["HUMAN", "LLM"]].dropna()
    correlation = valid["HUMAN"].corr(valid["LLM"])
    print(f"Human-LLM score correlation: {correlation:.3f}")

    # Compute mean absolute difference
    mad = (valid["HUMAN"] - valid["LLM"]).abs().mean()
    print(f"Mean absolute difference: {mad:.3f}")

# Label agreement analysis
label_pivot = annotations_df.pivot_table(
    values="label",
    index=annotations_df.index,
    columns="annotator_kind",
    aggfunc="first",
)

if "HUMAN" in label_pivot.columns and "LLM" in label_pivot.columns:
    valid_labels = label_pivot[["HUMAN", "LLM"]].dropna()
    agreement_rate = (valid_labels["HUMAN"] == valid_labels["LLM"]).mean()
    print(f"Label agreement rate: {agreement_rate:.1%}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment