Workflow:Arize ai Phoenix LLM Evaluation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | AI_Observability, LLM_Evaluation, Quality_Assurance |
| Last Updated | 2026-02-14 06:00 GMT |
Overview
End-to-end process for evaluating LLM application outputs using the Phoenix Evals framework with built-in and custom evaluators.
Description
This workflow covers the complete LLM evaluation pipeline using the arize-phoenix-evals package. It provides a unified framework for assessing LLM outputs through classification, scoring, and custom evaluation functions. The framework includes an LLM abstraction layer supporting multiple providers (OpenAI, Anthropic, Google, Bedrock, LiteLLM), pre-built evaluators for common metrics (hallucination, faithfulness, relevance, correctness), and tools for running evaluations against DataFrames at scale with concurrency control and rate limiting.
Key capabilities:
- Unified LLM abstraction across OpenAI, Anthropic, Google, Bedrock, LiteLLM, and LangChain
- Pre-built classification evaluators for hallucination, faithfulness, document relevance, correctness, and tool invocation
- Custom evaluator creation via the create_evaluator decorator
- DataFrame-based batch evaluation with async support
- Automatic tracing of evaluation runs via OpenTelemetry
- Adaptive rate limiting to avoid API throttling
Usage
Execute this workflow when you need to systematically assess the quality of LLM outputs. Common triggers include: evaluating a RAG pipeline for hallucination and relevance, benchmarking different models or prompts against a test dataset, running regression tests on LLM application changes, or computing metrics like faithfulness and correctness on production trace data exported from Phoenix.
Execution Steps
Step 1: Configure LLM Provider
Instantiate an LLM object by specifying the provider and model. The LLM abstraction automatically selects the correct adapter for the chosen provider and handles authentication, rate limiting, and response parsing.
Key considerations:
- Supported providers: OpenAI, Anthropic, Google, Vertex, Bedrock, LiteLLM, LangChain
- Authentication is typically handled via environment variables (e.g., OPENAI_API_KEY)
- The LLM object supports text generation, classification with structured output, and object generation with JSON schema
- Rate limiting is applied adaptively based on provider responses
Step 2: Select or Create Evaluators
Choose from built-in evaluators for common metrics or create custom evaluators. Built-in evaluators include HallucinationEvaluator, QACorrectnessEvaluator, RelevanceEvaluator, and FaithfulnessEvaluator. Custom evaluators can be created using the create_evaluator decorator or by subclassing ClassificationEvaluator.
Key considerations:
- Built-in evaluators use pre-configured prompt templates optimized for each metric
- ClassificationEvaluator maps LLM responses to labels and optional numeric scores
- The create_evaluator decorator turns any Python function into an evaluator
- Custom evaluators can return scores (float), labels (str), booleans, or Score objects
- Evaluators support input field binding via bind() to map DataFrame columns to evaluator inputs
Step 3: Prepare Evaluation Data
Organize the data to evaluate into a pandas DataFrame. Each row represents one evaluation example with columns for the relevant inputs (e.g., question, context, response, reference answer). Column names must match the evaluator's expected input fields or be mapped via binding.
Key considerations:
- DataFrame columns are mapped to evaluator input fields by name
- Use bind() to rename columns if they do not match expected field names
- For RAG evaluation, include context/retrieved documents alongside the response
- Data can be exported from Phoenix traces using the client API
Step 4: Execute Evaluation
Run the evaluators against the DataFrame using evaluate_dataframe() (synchronous) or async_evaluate_dataframe() (asynchronous). The framework handles parallelization, rate limiting, error handling, and progress reporting.
Key considerations:
- Async evaluation provides better throughput for large datasets
- Concurrency is automatically adjusted based on rate limit responses
- Failed evaluations return error information rather than crashing the entire run
- Evaluation runs are automatically traced via OpenTelemetry for debugging
- Results are returned as a DataFrame with one column per evaluator score
Step 5: Analyze and Store Results
Examine the evaluation results DataFrame, compute aggregate statistics, and optionally log the results back to Phoenix as span annotations. Results include scores, labels, and explanations for each evaluator on each example.
Key considerations:
- Each evaluator produces a Score object with name, score, label, explanation, and metadata
- Aggregate metrics (mean, distribution) can be computed from the results DataFrame
- Results can be logged to Phoenix as span annotations for visualization in the UI
- Evaluation results from multiple runs can be compared side-by-side in Phoenix experiments