Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Arize ai Phoenix Span Annotation

From Leeroopedia
Knowledge Sources
Domains AI Observability, Quality Assessment, Span Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

Span annotation is the practice of attaching structured quality assessments -- including labels, scores, and explanations -- to individual traced spans in an AI observability system.

Description

In AI observability, a span represents a discrete unit of work within a traced execution (e.g., an LLM call, a retrieval step, a tool invocation). Span annotation enriches these spans with human, automated, or LLM-generated quality judgments. Each annotation associates a span with a named assessment dimension (e.g., "relevance", "toxicity", "correctness") and captures one or more of:

  • Label: A categorical classification (e.g., "positive", "irrelevant", "hallucinated").
  • Score: A numerical rating on a continuous scale (e.g., 0.0 to 1.0).
  • Explanation: A free-text rationale for the assessment.
  • Metadata: Arbitrary key-value pairs for additional context.

Annotations are attributed to an annotator kind that indicates the source:

  • "HUMAN": Created by a human reviewer through the UI or API.
  • "LLM": Generated by an LLM-based evaluation pipeline.
  • "CODE": Produced by a deterministic code-based evaluator.

A span can have multiple annotations with different names (e.g., both "relevance" and "toxicity"), and the identifier field allows multiple annotations with the same name on the same span for different evaluation passes. Notes are a specialized form of span annotation that allow free-text commentary with auto-generated timestamp-based identifiers.

Usage

Use span annotation when:

  • Evaluating LLM output quality by attaching sentiment, correctness, or helpfulness assessments to individual generation spans.
  • Flagging problematic spans with labels like "hallucinated" or "toxic" for downstream filtering and analysis.
  • Building human review workflows where annotators assess span quality through the API after reviewing outputs.
  • Running automated evaluation pipelines where LLM judges or code-based heuristics score spans programmatically.
  • Adding free-text notes to spans for ad-hoc observations that do not fit structured annotation schemas.

Theoretical Basis

Span annotation implements a form of data labeling applied to execution traces rather than static datasets. The annotation model follows a multi-dimensional assessment framework:

Annotation = (span_id, annotation_name, annotator_kind, result, identifier)

where:
  result = {label: str?, score: float?, explanation: str?}
  uniqueness_key = (span_id, annotation_name, identifier)

The uniqueness constraint on (span_id, annotation_name, identifier) means that:

  • A new annotation with the same key upserts (updates) the existing record.
  • Using distinct identifiers allows multiple annotations with the same name on the same span (e.g., for multi-rater agreement studies).
  • A null identifier is treated as equivalent to an empty string for deduplication purposes.

The annotator kind dimension enables downstream analysis to be stratified by source, allowing comparison of human vs. LLM vs. code-based evaluations on the same spans.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment