Workflow:Arize ai Phoenix Span Annotation Pipeline

Knowledge Sources	Arize Phoenix Phoenix Docs
Domains	AI_Observability, Human_Feedback, LLM_Evaluation
Last Updated	2026-02-14 06:00 GMT

Overview

End-to-end process for adding human, LLM, or code-based annotations to traced spans for quality assessment and feedback collection.

Description

This workflow covers the span annotation pipeline in Phoenix, which enables attaching structured feedback to individual traced spans. Annotations can be created by human reviewers, LLM-based evaluators, or automated code heuristics. Each annotation includes a name (category), an annotator kind, and a result containing optional label, score, and explanation fields. Annotations can be added individually or in bulk via the Phoenix client SDK, and they are persisted alongside the traces for filtering, aggregation, and analysis in the Phoenix UI.

Key capabilities:

Three annotator kinds: HUMAN (manual review), LLM (automated evaluation), and CODE (heuristic rules)
Structured results with optional label (categorical), score (numeric), and explanation fields
Single annotation and batch annotation APIs
DataFrame-based annotation queries for analysis
Integration with the evaluation pipeline for automated annotation at scale
Annotation visibility in the Phoenix UI alongside trace data

Usage

Execute this workflow when you need to attach quality assessments or feedback to traced LLM interactions. Common scenarios include: human review of LLM outputs for quality assurance, automated evaluation of production traces using LLM-based classifiers, code-based heuristic checks (e.g., output length, format validation), or building labeled datasets from production data for fine-tuning.

Execution Steps

Step 1: Initialize the Phoenix Client

Create a Client instance configured with the Phoenix server endpoint. The client provides access to the spans resource which contains the annotation methods. Ensure the Phoenix server is running and accessible.

Key considerations:

Default endpoint is http://localhost:6006
Pass api_key for authenticated deployments
The client can also be configured via PHOENIX_BASE_URL environment variable
Annotations are attached to spans identified by their span_id

Step 2: Add Individual Annotations

Use client.spans.add_span_annotation() to add a single annotation to a specific span. Provide the span_id, annotation name (category), and result fields (label, score, explanation). Specify the annotator kind to indicate whether the annotation is from a human, LLM, or code.

Key considerations:

The span_id must reference an existing span in Phoenix
annotation_name categorizes the annotation (e.g., "sentiment", "quality", "relevance")
annotator_kind must be one of: "HUMAN", "LLM", or "CODE"
label is a categorical value (e.g., "positive", "negative", "correct", "incorrect")
score is a numeric value (typically 0.0 to 1.0)
explanation provides reasoning for the annotation
metadata stores arbitrary additional context

Step 3: Batch Annotate Spans

Use client.spans.log_span_annotations() to add multiple annotations in a single API call. Construct a list of SpanAnnotationData objects, each specifying the target span, annotation name, annotator kind, and result. This is more efficient for large-scale annotation operations.

Key considerations:

Each SpanAnnotationData object contains: name, span_id, annotator_kind, and result dict
The result dict contains label, score, and optional explanation
Batch operations reduce API round-trips compared to individual annotations
Multiple annotations with different names can be attached to the same span
This method integrates well with evaluation pipelines that produce results for many spans

Step 4: Query Annotations

Retrieve annotations using client.spans.get_span_annotations_dataframe() to analyze annotation results across spans. The returned DataFrame contains all annotations matching the query criteria, enabling statistical analysis and comparison.

Key considerations:

Filter annotations by span IDs or project identifier
Results include annotation name, label, score, explanation, and metadata
Use pandas operations to compute aggregate statistics (mean scores, label distributions)
Compare annotations across different annotator kinds (human vs LLM agreement)
Export annotation data for further analysis or model fine-tuning

Step 5: Analyze and Act on Annotations

Use annotation data to drive improvements to your LLM application. Identify patterns in low-scoring spans, compare human and automated evaluations, and use annotated data to create training datasets for fine-tuning or few-shot examples.

Key considerations:

Annotations are visible in the Phoenix UI alongside trace details
Filter traces by annotation scores to find problematic interactions
Compare human annotations with LLM evaluations to calibrate automated metrics
Export annotated spans as datasets for experimentation
Use annotation trends over time to monitor application quality

Execution Diagram

GitHub URL

Workflow Repository