Implementation:Arize ai Phoenix Legacy Public API

Overview

The Legacy Public API is the __init__.py module of the phoenix.evals.legacy subpackage. It serves as the entry point for the v1 evaluation system in the arize-phoenix-evals package, re-exporting all legacy evaluation functions, model classes, evaluator wrappers, template constants, and utility functions that formed the original phoenix-evals interface.

Description

This module consolidates the legacy (v1) API of phoenix-evals into a single import namespace. The v1 system was based on direct LLM classification of DataFrames using template strings and rail maps, in contrast to the modern (v2) API which uses the Evaluator / ClassificationEvaluator class hierarchy.

The legacy module re-exports from the following internal submodules:

phoenix.evals.legacy.classify -- Core classification functions (llm_classify, run_evals).
phoenix.evals.legacy.generate -- LLM generation function (llm_generate).
phoenix.evals.legacy.evaluators -- Pre-built evaluator classes wrapping common evaluation patterns.
phoenix.evals.legacy.models -- Model wrapper classes for various LLM providers.
phoenix.evals.legacy.default_templates -- Prompt template constants for built-in evaluation types.
phoenix.evals.legacy.span_templates -- Span-level prompt templates.
phoenix.evals.legacy.templates -- Template utility classes.
phoenix.evals.legacy.retrievals -- Retrieval metric computation (compute_precisions_at_k).
phoenix.evals.legacy.utils -- Utility constants and dataset download functions.

The module also exposes the package version via __version__.

Usage

from phoenix.evals.legacy import *

Or import specific symbols:

from phoenix.evals.legacy import llm_classify, OpenAIModel, HALLUCINATION_PROMPT_TEMPLATE

Code Reference

Property	Value
Source File	packages/phoenix-evals/src/phoenix/evals/legacy/__init__.py
Module	`phoenix.evals.legacy`
Lines	~155
Package	`arize-phoenix-evals`
Domain	API Surface, Legacy

Exported Symbols

Core Functions

Symbol	Source Module	Description
`llm_classify`	`legacy.classify`	Classifies rows of a DataFrame using an LLM with a prompt template and rail map. Central function of the v1 evaluation system.
`run_evals`	`legacy.classify`	Runs multiple evaluation templates against a DataFrame in batch.
`llm_generate`	`legacy.generate`	Generates freeform LLM responses for rows in a DataFrame.
`compute_precisions_at_k`	`legacy.retrievals`	Computes precision@K metrics for retrieval evaluation.
`download_benchmark_dataset`	`legacy.utils`	Downloads standard benchmark datasets for evaluation.

Model Classes

All model classes are imported from legacy.models:

Symbol	Description
`OpenAIModel`	Wrapper for OpenAI API models.
`AnthropicModel`	Wrapper for Anthropic API models.
`GeminiModel`	Wrapper for Google Gemini models.
`GoogleGenAIModel`	Wrapper for Google GenAI models.
`VertexAIModel`	Wrapper for Google Vertex AI models.
`BedrockModel`	Wrapper for AWS Bedrock models.
`LiteLLMModel`	Wrapper for the LiteLLM unified interface.
`MistralAIModel`	Wrapper for Mistral AI models.

Evaluator Classes

Pre-built evaluator wrappers from legacy.evaluators:

Symbol	Description
`LLMEvaluator`	Base legacy LLM evaluator class.
`HallucinationEvaluator`	Legacy hallucination detection evaluator.
`QAEvaluator`	Legacy question-answering quality evaluator.
`RelevanceEvaluator`	Legacy relevance evaluator.
`SQLEvaluator`	Legacy SQL generation evaluation evaluator.
`SummarizationEvaluator`	Legacy summarization quality evaluator.
`ToxicityEvaluator`	Legacy toxicity detection evaluator.

Template Classes

From legacy.templates:

Symbol	Description
`PromptTemplate`	Template class for constructing evaluation prompts with variable substitution.
`ClassificationTemplate`	Specialized template class for classification evaluation prompts.

Prompt Template Constants

All constants are imported from legacy.default_templates. Each evaluation type provides up to four variants:

Variant Suffix	Description
`_BASE_TEMPLATE`	The core prompt text without rails or explanation directives.
`_RAILS_MAP`	A dictionary mapping output labels to descriptions for classification.
`_TEMPLATE`	The standard prompt template ready for use with `llm_classify`.
`_TEMPLATE_WITH_EXPLANATION`	A variant that requests the LLM to include reasoning.

Evaluation types and their constant prefixes:

Evaluation Type	Prefix
Hallucination Detection	`HALLUCINATION_PROMPT_`
Question Answering (QA)	`QA_PROMPT_`
RAG Relevancy	`RAG_RELEVANCY_PROMPT_`
Code Functionality	`CODE_FUNCTIONALITY_PROMPT_`
Code Readability	`CODE_READABILITY_PROMPT_`
Human vs. AI	`HUMAN_VS_AI_PROMPT_`
Reference Link Correctness	`REFERENCE_LINK_CORRECTNESS_PROMPT_`
SQL Generation	`SQL_GEN_EVAL_PROMPT_`
Tool Calling	`TOOL_CALLING_`
Toxicity	`TOXICITY_PROMPT_`
User Frustration	`USER_FRUSTRATION_PROMPT_`

Span-Level Templates

From legacy.span_templates:

Symbol	Description
`HALLUCINATION_SPAN_PROMPT_TEMPLATE`	Span-level hallucination evaluation prompt template.
`QA_SPAN_PROMPT_TEMPLATE`	Span-level QA evaluation prompt template.
`TOOL_CALLING_SPAN_PROMPT_TEMPLATE`	Span-level tool calling evaluation prompt template.

Other Constants

Symbol	Source	Description
`NOT_PARSABLE`	`legacy.utils`	Sentinel string value returned when LLM output cannot be parsed into a valid classification label.

I/O Contract

This module is a re-export surface and does not define its own I/O contract. Each re-exported symbol has its own I/O contract in its respective source module.

The key I/O pattern for the legacy system is:

# Input: pandas DataFrame + model + template + rails
results_df = llm_classify(
    dataframe=df,           # pd.DataFrame with columns referenced in the template
    model=model,            # One of the Model classes (OpenAIModel, etc.)
    template=template,      # A prompt template string or PromptTemplate
    rails=rails_list,       # List of valid classification labels
    provide_explanation=True # Optional: include LLM explanations
)
# Output: pandas DataFrame with 'label' and optionally 'explanation' columns

Usage Examples

Legacy Classification with llm_classify

import pandas as pd
from phoenix.evals.legacy import (
    OpenAIModel,
    llm_classify,
    HALLUCINATION_PROMPT_TEMPLATE,
    HALLUCINATION_PROMPT_RAILS_MAP,
)

model = OpenAIModel(model="gpt-4o-mini")
df = pd.DataFrame({
    "input": ["What is the capital of France?"],
    "output": ["Paris is the capital of France."],
    "context": ["Paris is the capital and largest city of France."],
})

results = llm_classify(
    dataframe=df,
    model=model,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
)

Using Legacy Evaluator Wrappers

from phoenix.evals.legacy import HallucinationEvaluator, OpenAIModel

model = OpenAIModel(model="gpt-4o-mini")
evaluator = HallucinationEvaluator(model)

Running Batch Evaluations

from phoenix.evals.legacy import run_evals, OpenAIModel

model = OpenAIModel(model="gpt-4o-mini")
# run_evals executes multiple evaluation configurations against a DataFrame

Related Pages

Arize_ai_Phoenix_Evals_Public_API -- The top-level public API that re-exports this legacy module alongside the modern (v2) API.
Arize_ai_Phoenix_HallucinationEvaluator -- Modern (v2) hallucination evaluator (deprecated in favor of FaithfulnessEvaluator).
Arize_ai_Phoenix_CorrectnessEvaluator -- Modern (v2) correctness evaluator.
Arize_ai_Phoenix_FaithfulnessEvaluator -- Modern (v2) faithfulness evaluator.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment