Workflow:Arize ai Phoenix Span Annotation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | AI_Observability, Human_Feedback, LLM_Evaluation |
| Last Updated | 2026-02-14 06:00 GMT |
Overview
End-to-end process for adding human, LLM, or code-based annotations to traced spans for quality assessment and feedback collection.
Description
This workflow covers the span annotation pipeline in Phoenix, which enables attaching structured feedback to individual traced spans. Annotations can be created by human reviewers, LLM-based evaluators, or automated code heuristics. Each annotation includes a name (category), an annotator kind, and a result containing optional label, score, and explanation fields. Annotations can be added individually or in bulk via the Phoenix client SDK, and they are persisted alongside the traces for filtering, aggregation, and analysis in the Phoenix UI.
Key capabilities:
- Three annotator kinds: HUMAN (manual review), LLM (automated evaluation), and CODE (heuristic rules)
- Structured results with optional label (categorical), score (numeric), and explanation fields
- Single annotation and batch annotation APIs
- DataFrame-based annotation queries for analysis
- Integration with the evaluation pipeline for automated annotation at scale
- Annotation visibility in the Phoenix UI alongside trace data
Usage
Execute this workflow when you need to attach quality assessments or feedback to traced LLM interactions. Common scenarios include: human review of LLM outputs for quality assurance, automated evaluation of production traces using LLM-based classifiers, code-based heuristic checks (e.g., output length, format validation), or building labeled datasets from production data for fine-tuning.
Execution Steps
Step 1: Initialize the Phoenix Client
Create a Client instance configured with the Phoenix server endpoint. The client provides access to the spans resource which contains the annotation methods. Ensure the Phoenix server is running and accessible.
Key considerations:
- Default endpoint is http://localhost:6006
- Pass api_key for authenticated deployments
- The client can also be configured via PHOENIX_BASE_URL environment variable
- Annotations are attached to spans identified by their span_id
Step 2: Add Individual Annotations
Use client.spans.add_span_annotation() to add a single annotation to a specific span. Provide the span_id, annotation name (category), and result fields (label, score, explanation). Specify the annotator kind to indicate whether the annotation is from a human, LLM, or code.
Key considerations:
- The span_id must reference an existing span in Phoenix
- annotation_name categorizes the annotation (e.g., "sentiment", "quality", "relevance")
- annotator_kind must be one of: "HUMAN", "LLM", or "CODE"
- label is a categorical value (e.g., "positive", "negative", "correct", "incorrect")
- score is a numeric value (typically 0.0 to 1.0)
- explanation provides reasoning for the annotation
- metadata stores arbitrary additional context
Step 3: Batch Annotate Spans
Use client.spans.log_span_annotations() to add multiple annotations in a single API call. Construct a list of SpanAnnotationData objects, each specifying the target span, annotation name, annotator kind, and result. This is more efficient for large-scale annotation operations.
Key considerations:
- Each SpanAnnotationData object contains: name, span_id, annotator_kind, and result dict
- The result dict contains label, score, and optional explanation
- Batch operations reduce API round-trips compared to individual annotations
- Multiple annotations with different names can be attached to the same span
- This method integrates well with evaluation pipelines that produce results for many spans
Step 4: Query Annotations
Retrieve annotations using client.spans.get_span_annotations_dataframe() to analyze annotation results across spans. The returned DataFrame contains all annotations matching the query criteria, enabling statistical analysis and comparison.
Key considerations:
- Filter annotations by span IDs or project identifier
- Results include annotation name, label, score, explanation, and metadata
- Use pandas operations to compute aggregate statistics (mean scores, label distributions)
- Compare annotations across different annotator kinds (human vs LLM agreement)
- Export annotation data for further analysis or model fine-tuning
Step 5: Analyze and Act on Annotations
Use annotation data to drive improvements to your LLM application. Identify patterns in low-scoring spans, compare human and automated evaluations, and use annotated data to create training datasets for fine-tuning or few-shot examples.
Key considerations:
- Annotations are visible in the Phoenix UI alongside trace details
- Filter traces by annotation scores to find problematic interactions
- Compare human annotations with LLM evaluations to calibrate automated metrics
- Export annotated spans as datasets for experimentation
- Use annotation trends over time to monitor application quality