Principle:Arize ai Phoenix Annotation Analysis
| Knowledge Sources | |
|---|---|
| Domains | AI Observability, Data Analysis, Quality Metrics |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Annotation analysis is the practice of applying statistical and aggregation operations to span annotation data to derive quality insights, identify problematic patterns, and inform decisions about AI system improvement.
Description
After span annotations have been collected -- whether from human reviewers, LLM judges, or code-based evaluators -- the raw annotation data must be transformed into actionable insights. Annotation analysis encompasses a set of data analysis patterns applied to annotation DataFrames:
- Score Aggregation: Computing mean, median, and distribution statistics for annotation scores grouped by annotation name, project, or time period.
- Label Frequency Analysis: Counting the distribution of categorical labels to understand the prevalence of quality issues (e.g., what fraction of spans are labeled "hallucinated").
- Quality Filtering: Identifying spans that fall below quality thresholds based on annotation scores, enabling targeted review and remediation.
- Cross-Referencing with Span Data: Joining annotation DataFrames with span DataFrames to correlate quality assessments with span attributes (e.g., model version, prompt template, latency).
- Inter-Annotator Agreement: Comparing annotations from different annotator kinds (HUMAN, LLM, CODE) or different identifiers to assess consistency and reliability of evaluation approaches.
- Dataset Export: Using analysis results to create curated datasets for fine-tuning, few-shot prompting, or regression testing.
These patterns are not specific to any single API call but rather represent composable analytical operations that build upon the annotation retrieval step.
Usage
Use annotation analysis when:
- Monitoring model quality across a production deployment by tracking annotation score trends over time.
- Identifying failure modes by filtering for low-scoring spans and examining their attributes.
- Validating evaluation pipelines by comparing LLM judge annotations against human ground truth.
- Building training datasets by selecting high-quality or low-quality spans based on annotation scores.
- Reporting to stakeholders with aggregated quality metrics and visualizations.
- Tuning evaluation criteria by analyzing the distribution and variance of scores.
Theoretical Basis
Annotation analysis applies standard descriptive statistics and data manipulation techniques to the structured output of annotation queries. The core analytical operations can be formalized as:
Score Aggregation
For a set of annotations A grouped by annotation name:
mean_score(name) = sum(a.score for a in A if a.name == name) / count(a for a in A if a.name == name)
This generalizes to any aggregation function (median, standard deviation, percentiles) and can be further grouped by time window, annotator kind, or span attributes.
Quality Filtering
Given a threshold t, the set of low-quality spans is:
low_quality = {a.span_id for a in A if a.score < t}
These span IDs can then be used to retrieve the full span data for inspection.
Inter-Annotator Agreement
For annotations with the same name but different annotator kinds or identifiers, agreement can be measured using:
- Cohen's Kappa for categorical labels between two annotators.
- Pearson/Spearman Correlation for continuous scores between two annotators.
- Krippendorff's Alpha for multiple annotators with potentially missing ratings.
Agreement = correlation(
scores_from_annotator_A,
scores_from_annotator_B
)
Join Pattern
The cross-reference between annotations and spans follows a standard inner join on span_id:
enriched = annotations_df.join(spans_df, on="span_id")
This produces a combined view where each annotation row is augmented with the corresponding span attributes (model, latency, input/output tokens, etc.).