Implementation:Arize ai Phoenix Legacy Classify

Overview

The Legacy Classify module provides the core LLM-based classification framework for the Phoenix Evals subsystem. It implements llm_classify(), the primary function for applying large language model classifications to tabular data, and run_evals(), a batch evaluator orchestrator that applies multiple LLMEvaluator instances across a DataFrame in a single pass. The module also defines the ClassificationStatus enum for tracking the execution state of individual classification attempts.

llm_classify() supports both synchronous and asynchronous execution, OpenAI function calling for structured output, optional chain-of-thought explanations, and configurable concurrency with retry logic. It processes each row of an input DataFrame (or list) through a prompt template, sends the rendered prompt to an LLM, and parses the response into a classification label snapped to predefined "rails" (valid output classes).

run_evals() orchestrates multiple LLMEvaluator instances against a shared DataFrame, running all evaluator-record pairs concurrently and returning one result DataFrame per evaluator.

Code Reference

Attribute	Details
Source File	`packages/phoenix-evals/src/phoenix/evals/legacy/classify.py`
Repository	Arize-ai/phoenix
Lines	521
Module	`phoenix.evals.legacy.classify`
Key Symbols	`llm_classify()`, `run_evals()`, `ClassificationStatus`
Dependencies	`pandas`, `phoenix.evals.legacy.evaluators`, `phoenix.evals.legacy.executors`, `phoenix.evals.legacy.models`, `phoenix.evals.legacy.templates`, `phoenix.evals.legacy.utils`

I/O Contract

llm_classify()

Parameter	Type	Description
`data`	`Union[pd.DataFrame, List[Any]]`	Input data containing template variables. DataFrame columns or list elements are mapped to template placeholders.
`model`	`BaseModel`	An LLM model instance (e.g., `OpenAIModel`) used to generate classifications.
`template`	`Union[ClassificationTemplate, PromptTemplate, str]`	The prompt template defining the classification task.
`rails`	`List[str]`	Valid output labels the model response is snapped to.
`data_processor`	`Optional[Callable]`	Optional callable to preprocess each input row before template mapping.
`system_instruction`	`Optional[str]`	Optional system message prepended to the LLM prompt.
`provide_explanation`	`bool`	If True, adds an `explanation` column to output.
`use_function_calling_if_available`	`bool`	If True, uses OpenAI function calling to constrain outputs.
`include_prompt`	`bool`	If True, includes the rendered prompt in the output.
`include_response`	`bool`	If True, includes the raw LLM response in the output.
`max_retries`	`int`	Maximum retry attempts per classification (default: 10).
`exit_on_error`	`bool`	If True, halts on exhausted retries; otherwise continues.
`run_sync`	`bool`	If True, forces synchronous execution.
`concurrency`	`Optional[int]`	Number of concurrent async requests.
Returns	`pd.DataFrame`	DataFrame with columns: `label`, optionally `explanation`, `prompt`, `response`, plus `exceptions`, `execution_status`, `execution_seconds`, `prompt_tokens`, `completion_tokens`, `total_tokens`.

run_evals()

Parameter	Type	Description
`dataframe`	`DataFrame`	Input records to evaluate.
`evaluators`	`List[LLMEvaluator]`	List of evaluator instances to apply.
`provide_explanation`	`bool`	Whether to include explanations in output.
`use_function_calling_if_available`	`bool`	Whether to use OpenAI function calling.
`concurrency`	`Optional[int]`	Concurrent evaluation limit.
Returns	`List[DataFrame]`	One DataFrame per evaluator with `label`, `score`, and optionally `explanation` columns.

ClassificationStatus Enum

Value	Description
`DID_NOT_RUN`	Evaluation was not attempted.
`COMPLETED`	Evaluation completed successfully on first attempt.
`COMPLETED_WITH_RETRIES`	Evaluation completed after retrying.
`FAILED`	Evaluation failed after exhausting all retries.
`MISSING_INPUT`	Required template variables were missing from the input row.

Usage Examples

from phoenix.evals.legacy.classify import llm_classify, run_evals
from phoenix.evals.legacy.models import OpenAIModel
from phoenix.evals.legacy.default_templates import HALLUCINATION_PROMPT_TEMPLATE
import pandas as pd

# Basic classification with llm_classify
model = OpenAIModel(model="gpt-4")
df = pd.DataFrame({
    "input": ["What is Python?"],
    "reference": ["Python is a programming language."],
    "output": ["Python is a type of snake."],
})

result = llm_classify(
    data=df,
    model=model,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=["hallucinated", "factual"],
    provide_explanation=True,
)
# result contains columns: label, explanation, exceptions, execution_status, ...

from phoenix.evals.legacy.evaluators import HallucinationEvaluator, QAEvaluator

# Batch evaluation with run_evals
hallucination_eval = HallucinationEvaluator(model=model)
qa_eval = QAEvaluator(model=model)

results = run_evals(
    dataframe=df,
    evaluators=[hallucination_eval, qa_eval],
    provide_explanation=True,
)
# results[0] = hallucination DataFrame, results[1] = QA DataFrame

Related Pages

Arize_ai_Phoenix_Legacy_Evaluators - LLMEvaluator classes consumed by run_evals()
Arize_ai_Phoenix_Legacy_Templates - PromptTemplate and ClassificationTemplate used for prompt rendering
Arize_ai_Phoenix_Legacy_Default_Templates - Predefined classification templates (e.g., hallucination, relevance)
Arize_ai_Phoenix_Legacy_Utils - Utility functions for rail snapping and function call parsing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment