Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix Legacy Evaluators

From Leeroopedia

LLM_Evaluation AI_Observability

Overview

The Legacy Evaluators module provides a hierarchy of LLM-based evaluator classes for the Phoenix Evals subsystem. The base class LLMEvaluator encapsulates a model and a classification template, providing both synchronous (evaluate()) and asynchronous (aevaluate()) evaluation methods that return a label, score, and optional explanation for a single record.

Six task-specific evaluator subclasses extend LLMEvaluator with preconfigured templates from EvalCriteria: HallucinationEvaluator, RelevanceEvaluator, QAEvaluator, ToxicityEvaluator, SummarizationEvaluator, and SQLEvaluator.

The module also defines two long-context strategies for data that exceeds a single LLM context window: MapReducer (evaluates each chunk independently then combines results) and Refiner (iteratively refines an accumulator across sequential chunks with an optional synthesis step).

Code Reference

Attribute Details
Source File packages/phoenix-evals/src/phoenix/evals/legacy/evaluators.py
Repository Arize-ai/phoenix
Lines 447
Module phoenix.evals.legacy.evaluators
Key Symbols LLMEvaluator, HallucinationEvaluator, RelevanceEvaluator, QAEvaluator, ToxicityEvaluator, SummarizationEvaluator, SQLEvaluator, MapReducer, Refiner
Dependencies phoenix.evals.default_templates.EvalCriteria, phoenix.evals.models.BaseModel, phoenix.evals.models.OpenAIModel, phoenix.evals.templates.ClassificationTemplate, phoenix.evals.utils

I/O Contract

LLMEvaluator

Method Parameters Returns
__init__(model, template) model: BaseModel, template: ClassificationTemplate None
evaluate(record, ...) record: Mapping[str, str], provide_explanation: bool, use_function_calling_if_available: bool, verbose: bool Tuple[str, Optional[float], Optional[str]] - (label, score, explanation)
aevaluate(record, ...) Same as evaluate Same as evaluate (async)
reload_client() None None - reloads the underlying model client
default_concurrency Property int - default concurrency from the model

Task-Specific Evaluators

Class Template (via EvalCriteria) Expected DataFrame Columns Rails
HallucinationEvaluator HALLUCINATION input, reference, output ["hallucinated", "factual"]
RelevanceEvaluator RELEVANCE input, reference ["relevant", "unrelated"]
QAEvaluator QA input, reference, output ["correct", "incorrect"]
ToxicityEvaluator TOXICITY input ["toxic", "non-toxic"]
SummarizationEvaluator SUMMARIZATION output, input ["good", "bad"]
SQLEvaluator SQL_GEN_EVAL question, query_gen, response ["correct", "incorrect"]

MapReducer

Method Parameters Returns
__init__(model, map_prompt_template, reduce_prompt_template) model: BaseModel, map_prompt_template: PromptTemplate (must contain {chunk}), reduce_prompt_template: PromptTemplate (must contain {mapped}) None
evaluate(chunks) chunks: List[str] (minimum 2) str - combined evaluation result

Refiner

Method Parameters Returns
__init__(model, initial_prompt_template, refine_prompt_template, synthesize_prompt_template) model: BaseModel, initial_prompt_template: PromptTemplate (contains {chunk}), refine_prompt_template: PromptTemplate (contains {chunk} and {accumulator}), synthesize_prompt_template: Optional[PromptTemplate] (contains {accumulator}) None
evaluate(chunks) chunks: List[str] (minimum 2) str - final refined evaluation result

Usage Examples

from phoenix.evals.legacy.evaluators import (
    HallucinationEvaluator,
    RelevanceEvaluator,
    MapReducer,
    Refiner,
)
from phoenix.evals.legacy.models import OpenAIModel
from phoenix.evals.legacy.templates import PromptTemplate

model = OpenAIModel(model="gpt-4")

# Single record evaluation
evaluator = HallucinationEvaluator(model=model)
label, score, explanation = evaluator.evaluate(
    record={
        "input": "What is the capital of France?",
        "reference": "Paris is the capital of France.",
        "output": "The capital of France is London.",
    },
    provide_explanation=True,
)
# label = "hallucinated", score = 1.0, explanation = "..."
# Long-context evaluation with MapReducer
map_template = PromptTemplate(template="Summarize the following chunk:\n{chunk}")
reduce_template = PromptTemplate(
    template="Combine these summaries into a final assessment:\n{mapped}"
)
map_reducer = MapReducer(
    model=model,
    map_prompt_template=map_template,
    reduce_prompt_template=reduce_template,
)
result = map_reducer.evaluate(["chunk 1 text...", "chunk 2 text..."])
# Long-context evaluation with Refiner
initial_template = PromptTemplate(template="Analyze this section:\n{chunk}")
refine_template = PromptTemplate(
    template="Update your analysis with new information:\n"
    "Previous analysis: {accumulator}\nNew section: {chunk}"
)
refiner = Refiner(
    model=model,
    initial_prompt_template=initial_template,
    refine_prompt_template=refine_template,
)
result = refiner.evaluate(["section 1...", "section 2...", "section 3..."])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment