Workflow:Arize ai Phoenix LLM Evaluation Pipeline

Knowledge Sources	Arize Phoenix Phoenix Evaluation Docs
Domains	AI_Observability, LLM_Evaluation, Quality_Assurance
Last Updated	2026-02-14 06:00 GMT

Overview

End-to-end process for evaluating LLM application outputs using the Phoenix Evals framework with built-in and custom evaluators.

Description

This workflow covers the complete LLM evaluation pipeline using the arize-phoenix-evals package. It provides a unified framework for assessing LLM outputs through classification, scoring, and custom evaluation functions. The framework includes an LLM abstraction layer supporting multiple providers (OpenAI, Anthropic, Google, Bedrock, LiteLLM), pre-built evaluators for common metrics (hallucination, faithfulness, relevance, correctness), and tools for running evaluations against DataFrames at scale with concurrency control and rate limiting.

Key capabilities:

Unified LLM abstraction across OpenAI, Anthropic, Google, Bedrock, LiteLLM, and LangChain
Pre-built classification evaluators for hallucination, faithfulness, document relevance, correctness, and tool invocation
Custom evaluator creation via the create_evaluator decorator
DataFrame-based batch evaluation with async support
Automatic tracing of evaluation runs via OpenTelemetry
Adaptive rate limiting to avoid API throttling

Usage

Execute this workflow when you need to systematically assess the quality of LLM outputs. Common triggers include: evaluating a RAG pipeline for hallucination and relevance, benchmarking different models or prompts against a test dataset, running regression tests on LLM application changes, or computing metrics like faithfulness and correctness on production trace data exported from Phoenix.

Execution Steps

Step 1: Configure LLM Provider

Instantiate an LLM object by specifying the provider and model. The LLM abstraction automatically selects the correct adapter for the chosen provider and handles authentication, rate limiting, and response parsing.

Key considerations:

Supported providers: OpenAI, Anthropic, Google, Vertex, Bedrock, LiteLLM, LangChain
Authentication is typically handled via environment variables (e.g., OPENAI_API_KEY)
The LLM object supports text generation, classification with structured output, and object generation with JSON schema
Rate limiting is applied adaptively based on provider responses

Step 2: Select or Create Evaluators

Choose from built-in evaluators for common metrics or create custom evaluators. Built-in evaluators include HallucinationEvaluator, QACorrectnessEvaluator, RelevanceEvaluator, and FaithfulnessEvaluator. Custom evaluators can be created using the create_evaluator decorator or by subclassing ClassificationEvaluator.

Key considerations:

Built-in evaluators use pre-configured prompt templates optimized for each metric
ClassificationEvaluator maps LLM responses to labels and optional numeric scores
The create_evaluator decorator turns any Python function into an evaluator
Custom evaluators can return scores (float), labels (str), booleans, or Score objects
Evaluators support input field binding via bind() to map DataFrame columns to evaluator inputs

Step 3: Prepare Evaluation Data

Organize the data to evaluate into a pandas DataFrame. Each row represents one evaluation example with columns for the relevant inputs (e.g., question, context, response, reference answer). Column names must match the evaluator's expected input fields or be mapped via binding.

Key considerations:

DataFrame columns are mapped to evaluator input fields by name
Use bind() to rename columns if they do not match expected field names
For RAG evaluation, include context/retrieved documents alongside the response
Data can be exported from Phoenix traces using the client API

Step 4: Execute Evaluation

Run the evaluators against the DataFrame using evaluate_dataframe() (synchronous) or async_evaluate_dataframe() (asynchronous). The framework handles parallelization, rate limiting, error handling, and progress reporting.

Key considerations:

Async evaluation provides better throughput for large datasets
Concurrency is automatically adjusted based on rate limit responses
Failed evaluations return error information rather than crashing the entire run
Evaluation runs are automatically traced via OpenTelemetry for debugging
Results are returned as a DataFrame with one column per evaluator score

Step 5: Analyze and Store Results

Examine the evaluation results DataFrame, compute aggregate statistics, and optionally log the results back to Phoenix as span annotations. Results include scores, labels, and explanations for each evaluator on each example.

Key considerations:

Each evaluator produces a Score object with name, score, label, explanation, and metadata
Aggregate metrics (mean, distribution) can be computed from the results DataFrame
Results can be logged to Phoenix as span annotations for visualization in the UI
Evaluation results from multiple runs can be compared side-by-side in Phoenix experiments

Execution Diagram

GitHub URL

Workflow Repository