Implementation:Arize ai Phoenix Legacy Classify
LLM_Evaluation Data_Processing
Overview
The Legacy Classify module provides the core LLM-based classification framework for the Phoenix Evals subsystem. It implements llm_classify(), the primary function for applying large language model classifications to tabular data, and run_evals(), a batch evaluator orchestrator that applies multiple LLMEvaluator instances across a DataFrame in a single pass. The module also defines the ClassificationStatus enum for tracking the execution state of individual classification attempts.
llm_classify() supports both synchronous and asynchronous execution, OpenAI function calling for structured output, optional chain-of-thought explanations, and configurable concurrency with retry logic. It processes each row of an input DataFrame (or list) through a prompt template, sends the rendered prompt to an LLM, and parses the response into a classification label snapped to predefined "rails" (valid output classes).
run_evals() orchestrates multiple LLMEvaluator instances against a shared DataFrame, running all evaluator-record pairs concurrently and returning one result DataFrame per evaluator.
Code Reference
| Attribute | Details |
|---|---|
| Source File | packages/phoenix-evals/src/phoenix/evals/legacy/classify.py
|
| Repository | Arize-ai/phoenix |
| Lines | 521 |
| Module | phoenix.evals.legacy.classify
|
| Key Symbols | llm_classify(), run_evals(), ClassificationStatus
|
| Dependencies | pandas, phoenix.evals.legacy.evaluators, phoenix.evals.legacy.executors, phoenix.evals.legacy.models, phoenix.evals.legacy.templates, phoenix.evals.legacy.utils
|
I/O Contract
llm_classify()
| Parameter | Type | Description |
|---|---|---|
data |
Union[pd.DataFrame, List[Any]] |
Input data containing template variables. DataFrame columns or list elements are mapped to template placeholders. |
model |
BaseModel |
An LLM model instance (e.g., OpenAIModel) used to generate classifications.
|
template |
Union[ClassificationTemplate, PromptTemplate, str] |
The prompt template defining the classification task. |
rails |
List[str] |
Valid output labels the model response is snapped to. |
data_processor |
Optional[Callable] |
Optional callable to preprocess each input row before template mapping. |
system_instruction |
Optional[str] |
Optional system message prepended to the LLM prompt. |
provide_explanation |
bool |
If True, adds an explanation column to output.
|
use_function_calling_if_available |
bool |
If True, uses OpenAI function calling to constrain outputs. |
include_prompt |
bool |
If True, includes the rendered prompt in the output. |
include_response |
bool |
If True, includes the raw LLM response in the output. |
max_retries |
int |
Maximum retry attempts per classification (default: 10). |
exit_on_error |
bool |
If True, halts on exhausted retries; otherwise continues. |
run_sync |
bool |
If True, forces synchronous execution. |
concurrency |
Optional[int] |
Number of concurrent async requests. |
| Returns | pd.DataFrame |
DataFrame with columns: label, optionally explanation, prompt, response, plus exceptions, execution_status, execution_seconds, prompt_tokens, completion_tokens, total_tokens.
|
run_evals()
| Parameter | Type | Description |
|---|---|---|
dataframe |
DataFrame |
Input records to evaluate. |
evaluators |
List[LLMEvaluator] |
List of evaluator instances to apply. |
provide_explanation |
bool |
Whether to include explanations in output. |
use_function_calling_if_available |
bool |
Whether to use OpenAI function calling. |
concurrency |
Optional[int] |
Concurrent evaluation limit. |
| Returns | List[DataFrame] |
One DataFrame per evaluator with label, score, and optionally explanation columns.
|
ClassificationStatus Enum
| Value | Description |
|---|---|
DID_NOT_RUN |
Evaluation was not attempted. |
COMPLETED |
Evaluation completed successfully on first attempt. |
COMPLETED_WITH_RETRIES |
Evaluation completed after retrying. |
FAILED |
Evaluation failed after exhausting all retries. |
MISSING_INPUT |
Required template variables were missing from the input row. |
Usage Examples
from phoenix.evals.legacy.classify import llm_classify, run_evals
from phoenix.evals.legacy.models import OpenAIModel
from phoenix.evals.legacy.default_templates import HALLUCINATION_PROMPT_TEMPLATE
import pandas as pd
# Basic classification with llm_classify
model = OpenAIModel(model="gpt-4")
df = pd.DataFrame({
"input": ["What is Python?"],
"reference": ["Python is a programming language."],
"output": ["Python is a type of snake."],
})
result = llm_classify(
data=df,
model=model,
template=HALLUCINATION_PROMPT_TEMPLATE,
rails=["hallucinated", "factual"],
provide_explanation=True,
)
# result contains columns: label, explanation, exceptions, execution_status, ...
from phoenix.evals.legacy.evaluators import HallucinationEvaluator, QAEvaluator
# Batch evaluation with run_evals
hallucination_eval = HallucinationEvaluator(model=model)
qa_eval = QAEvaluator(model=model)
results = run_evals(
dataframe=df,
evaluators=[hallucination_eval, qa_eval],
provide_explanation=True,
)
# results[0] = hallucination DataFrame, results[1] = QA DataFrame
Related Pages
- Arize_ai_Phoenix_Legacy_Evaluators - LLMEvaluator classes consumed by
run_evals() - Arize_ai_Phoenix_Legacy_Templates - PromptTemplate and ClassificationTemplate used for prompt rendering
- Arize_ai_Phoenix_Legacy_Default_Templates - Predefined classification templates (e.g., hallucination, relevance)
- Arize_ai_Phoenix_Legacy_Utils - Utility functions for rail snapping and function call parsing