Implementation:Arize ai Phoenix Legacy Evaluators

Overview

The Legacy Evaluators module provides a hierarchy of LLM-based evaluator classes for the Phoenix Evals subsystem. The base class LLMEvaluator encapsulates a model and a classification template, providing both synchronous (evaluate()) and asynchronous (aevaluate()) evaluation methods that return a label, score, and optional explanation for a single record.

Six task-specific evaluator subclasses extend LLMEvaluator with preconfigured templates from EvalCriteria: HallucinationEvaluator, RelevanceEvaluator, QAEvaluator, ToxicityEvaluator, SummarizationEvaluator, and SQLEvaluator.

The module also defines two long-context strategies for data that exceeds a single LLM context window: MapReducer (evaluates each chunk independently then combines results) and Refiner (iteratively refines an accumulator across sequential chunks with an optional synthesis step).

Code Reference

Attribute	Details
Source File	`packages/phoenix-evals/src/phoenix/evals/legacy/evaluators.py`
Repository	Arize-ai/phoenix
Lines	447
Module	`phoenix.evals.legacy.evaluators`
Key Symbols	`LLMEvaluator`, `HallucinationEvaluator`, `RelevanceEvaluator`, `QAEvaluator`, `ToxicityEvaluator`, `SummarizationEvaluator`, `SQLEvaluator`, `MapReducer`, `Refiner`
Dependencies	`phoenix.evals.default_templates.EvalCriteria`, `phoenix.evals.models.BaseModel`, `phoenix.evals.models.OpenAIModel`, `phoenix.evals.templates.ClassificationTemplate`, `phoenix.evals.utils`

I/O Contract

LLMEvaluator

Method	Parameters	Returns
`__init__(model, template)`	`model: BaseModel`, `template: ClassificationTemplate`	None
`evaluate(record, ...)`	`record: Mapping[str, str]`, `provide_explanation: bool`, `use_function_calling_if_available: bool`, `verbose: bool`	`Tuple[str, Optional[float], Optional[str]]` - (label, score, explanation)
`aevaluate(record, ...)`	Same as `evaluate`	Same as `evaluate` (async)
`reload_client()`	None	None - reloads the underlying model client
`default_concurrency`	Property	`int` - default concurrency from the model

Task-Specific Evaluators

Class	Template (via EvalCriteria)	Expected DataFrame Columns	Rails
HallucinationEvaluator	`HALLUCINATION`	`input`, `reference`, `output`	`["hallucinated", "factual"]`
RelevanceEvaluator	`RELEVANCE`	`input`, `reference`	`["relevant", "unrelated"]`
QAEvaluator	`QA`	`input`, `reference`, `output`	`["correct", "incorrect"]`
ToxicityEvaluator	`TOXICITY`	`input`	`["toxic", "non-toxic"]`
SummarizationEvaluator	`SUMMARIZATION`	`output`, `input`	`["good", "bad"]`
SQLEvaluator	`SQL_GEN_EVAL`	`question`, `query_gen`, `response`	`["correct", "incorrect"]`

MapReducer

Method	Parameters	Returns
`__init__(model, map_prompt_template, reduce_prompt_template)`	`model: BaseModel`, `map_prompt_template: PromptTemplate` (must contain `{chunk}`), `reduce_prompt_template: PromptTemplate` (must contain `{mapped}`)	None
`evaluate(chunks)`	`chunks: List[str]` (minimum 2)	`str` - combined evaluation result

Refiner

Method	Parameters	Returns
`__init__(model, initial_prompt_template, refine_prompt_template, synthesize_prompt_template)`	`model: BaseModel`, `initial_prompt_template: PromptTemplate` (contains `{chunk}`), `refine_prompt_template: PromptTemplate` (contains `{chunk}` and `{accumulator}`), `synthesize_prompt_template: Optional[PromptTemplate]` (contains `{accumulator}`)	None
`evaluate(chunks)`	`chunks: List[str]` (minimum 2)	`str` - final refined evaluation result

Usage Examples

from phoenix.evals.legacy.evaluators import (
    HallucinationEvaluator,
    RelevanceEvaluator,
    MapReducer,
    Refiner,
)
from phoenix.evals.legacy.models import OpenAIModel
from phoenix.evals.legacy.templates import PromptTemplate

model = OpenAIModel(model="gpt-4")

# Single record evaluation
evaluator = HallucinationEvaluator(model=model)
label, score, explanation = evaluator.evaluate(
    record={
        "input": "What is the capital of France?",
        "reference": "Paris is the capital of France.",
        "output": "The capital of France is London.",
    },
    provide_explanation=True,
)
# label = "hallucinated", score = 1.0, explanation = "..."

# Long-context evaluation with MapReducer
map_template = PromptTemplate(template="Summarize the following chunk:\n{chunk}")
reduce_template = PromptTemplate(
    template="Combine these summaries into a final assessment:\n{mapped}"
)
map_reducer = MapReducer(
    model=model,
    map_prompt_template=map_template,
    reduce_prompt_template=reduce_template,
)
result = map_reducer.evaluate(["chunk 1 text...", "chunk 2 text..."])

# Long-context evaluation with Refiner
initial_template = PromptTemplate(template="Analyze this section:\n{chunk}")
refine_template = PromptTemplate(
    template="Update your analysis with new information:\n"
    "Previous analysis: {accumulator}\nNew section: {chunk}"
)
refiner = Refiner(
    model=model,
    initial_prompt_template=initial_template,
    refine_prompt_template=refine_template,
)
result = refiner.evaluate(["section 1...", "section 2...", "section 3..."])

Related Pages

Arize_ai_Phoenix_Legacy_Classify - run_evals() orchestrates these evaluators across DataFrames
Arize_ai_Phoenix_Legacy_Default_Templates - Predefined ClassificationTemplate constants used by subclasses
Arize_ai_Phoenix_Legacy_Templates - ClassificationTemplate and PromptTemplate base classes
Arize_ai_Phoenix_Legacy_Utils - Rail snapping and OpenAI function call parsing utilities

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment