Principle:Arize ai Phoenix Span Annotation
| Knowledge Sources | |
|---|---|
| Domains | AI Observability, Quality Assessment, Span Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Span annotation is the practice of attaching structured quality assessments -- including labels, scores, and explanations -- to individual traced spans in an AI observability system.
Description
In AI observability, a span represents a discrete unit of work within a traced execution (e.g., an LLM call, a retrieval step, a tool invocation). Span annotation enriches these spans with human, automated, or LLM-generated quality judgments. Each annotation associates a span with a named assessment dimension (e.g., "relevance", "toxicity", "correctness") and captures one or more of:
- Label: A categorical classification (e.g., "positive", "irrelevant", "hallucinated").
- Score: A numerical rating on a continuous scale (e.g., 0.0 to 1.0).
- Explanation: A free-text rationale for the assessment.
- Metadata: Arbitrary key-value pairs for additional context.
Annotations are attributed to an annotator kind that indicates the source:
- "HUMAN": Created by a human reviewer through the UI or API.
- "LLM": Generated by an LLM-based evaluation pipeline.
- "CODE": Produced by a deterministic code-based evaluator.
A span can have multiple annotations with different names (e.g., both "relevance" and "toxicity"), and the identifier field allows multiple annotations with the same name on the same span for different evaluation passes. Notes are a specialized form of span annotation that allow free-text commentary with auto-generated timestamp-based identifiers.
Usage
Use span annotation when:
- Evaluating LLM output quality by attaching sentiment, correctness, or helpfulness assessments to individual generation spans.
- Flagging problematic spans with labels like "hallucinated" or "toxic" for downstream filtering and analysis.
- Building human review workflows where annotators assess span quality through the API after reviewing outputs.
- Running automated evaluation pipelines where LLM judges or code-based heuristics score spans programmatically.
- Adding free-text notes to spans for ad-hoc observations that do not fit structured annotation schemas.
Theoretical Basis
Span annotation implements a form of data labeling applied to execution traces rather than static datasets. The annotation model follows a multi-dimensional assessment framework:
Annotation = (span_id, annotation_name, annotator_kind, result, identifier)
where:
result = {label: str?, score: float?, explanation: str?}
uniqueness_key = (span_id, annotation_name, identifier)
The uniqueness constraint on (span_id, annotation_name, identifier) means that:
- A new annotation with the same key upserts (updates) the existing record.
- Using distinct identifiers allows multiple annotations with the same name on the same span (e.g., for multi-rater agreement studies).
- A
nullidentifier is treated as equivalent to an empty string for deduplication purposes.
The annotator kind dimension enables downstream analysis to be stratified by source, allowing comparison of human vs. LLM vs. code-based evaluations on the same spans.