LLM_Evaluation AI_Observability
Overview
The Legacy Evaluators module provides a hierarchy of LLM-based evaluator classes for the Phoenix Evals subsystem. The base class LLMEvaluator encapsulates a model and a classification template, providing both synchronous (evaluate()) and asynchronous (aevaluate()) evaluation methods that return a label, score, and optional explanation for a single record.
Six task-specific evaluator subclasses extend LLMEvaluator with preconfigured templates from EvalCriteria: HallucinationEvaluator, RelevanceEvaluator, QAEvaluator, ToxicityEvaluator, SummarizationEvaluator, and SQLEvaluator.
The module also defines two long-context strategies for data that exceeds a single LLM context window: MapReducer (evaluates each chunk independently then combines results) and Refiner (iteratively refines an accumulator across sequential chunks with an optional synthesis step).
Code Reference
| Attribute |
Details
|
| Source File |
packages/phoenix-evals/src/phoenix/evals/legacy/evaluators.py
|
| Repository |
Arize-ai/phoenix
|
| Lines |
447
|
| Module |
phoenix.evals.legacy.evaluators
|
| Key Symbols |
LLMEvaluator, HallucinationEvaluator, RelevanceEvaluator, QAEvaluator, ToxicityEvaluator, SummarizationEvaluator, SQLEvaluator, MapReducer, Refiner
|
| Dependencies |
phoenix.evals.default_templates.EvalCriteria, phoenix.evals.models.BaseModel, phoenix.evals.models.OpenAIModel, phoenix.evals.templates.ClassificationTemplate, phoenix.evals.utils
|
I/O Contract
LLMEvaluator
| Method |
Parameters |
Returns
|
__init__(model, template) |
model: BaseModel, template: ClassificationTemplate |
None
|
evaluate(record, ...) |
record: Mapping[str, str], provide_explanation: bool, use_function_calling_if_available: bool, verbose: bool |
Tuple[str, Optional[float], Optional[str]] - (label, score, explanation)
|
aevaluate(record, ...) |
Same as evaluate |
Same as evaluate (async)
|
reload_client() |
None |
None - reloads the underlying model client
|
default_concurrency |
Property |
int - default concurrency from the model
|
Task-Specific Evaluators
| Class |
Template (via EvalCriteria) |
Expected DataFrame Columns |
Rails
|
| HallucinationEvaluator |
HALLUCINATION |
input, reference, output |
["hallucinated", "factual"]
|
| RelevanceEvaluator |
RELEVANCE |
input, reference |
["relevant", "unrelated"]
|
| QAEvaluator |
QA |
input, reference, output |
["correct", "incorrect"]
|
| ToxicityEvaluator |
TOXICITY |
input |
["toxic", "non-toxic"]
|
| SummarizationEvaluator |
SUMMARIZATION |
output, input |
["good", "bad"]
|
| SQLEvaluator |
SQL_GEN_EVAL |
question, query_gen, response |
["correct", "incorrect"]
|
MapReducer
| Method |
Parameters |
Returns
|
__init__(model, map_prompt_template, reduce_prompt_template) |
model: BaseModel, map_prompt_template: PromptTemplate (must contain {chunk}), reduce_prompt_template: PromptTemplate (must contain {mapped}) |
None
|
evaluate(chunks) |
chunks: List[str] (minimum 2) |
str - combined evaluation result
|
Refiner
| Method |
Parameters |
Returns
|
__init__(model, initial_prompt_template, refine_prompt_template, synthesize_prompt_template) |
model: BaseModel, initial_prompt_template: PromptTemplate (contains {chunk}), refine_prompt_template: PromptTemplate (contains {chunk} and {accumulator}), synthesize_prompt_template: Optional[PromptTemplate] (contains {accumulator}) |
None
|
evaluate(chunks) |
chunks: List[str] (minimum 2) |
str - final refined evaluation result
|
Usage Examples
from phoenix.evals.legacy.evaluators import (
HallucinationEvaluator,
RelevanceEvaluator,
MapReducer,
Refiner,
)
from phoenix.evals.legacy.models import OpenAIModel
from phoenix.evals.legacy.templates import PromptTemplate
model = OpenAIModel(model="gpt-4")
# Single record evaluation
evaluator = HallucinationEvaluator(model=model)
label, score, explanation = evaluator.evaluate(
record={
"input": "What is the capital of France?",
"reference": "Paris is the capital of France.",
"output": "The capital of France is London.",
},
provide_explanation=True,
)
# label = "hallucinated", score = 1.0, explanation = "..."
# Long-context evaluation with MapReducer
map_template = PromptTemplate(template="Summarize the following chunk:\n{chunk}")
reduce_template = PromptTemplate(
template="Combine these summaries into a final assessment:\n{mapped}"
)
map_reducer = MapReducer(
model=model,
map_prompt_template=map_template,
reduce_prompt_template=reduce_template,
)
result = map_reducer.evaluate(["chunk 1 text...", "chunk 2 text..."])
# Long-context evaluation with Refiner
initial_template = PromptTemplate(template="Analyze this section:\n{chunk}")
refine_template = PromptTemplate(
template="Update your analysis with new information:\n"
"Previous analysis: {accumulator}\nNew section: {chunk}"
)
refiner = Refiner(
model=model,
initial_prompt_template=initial_template,
refine_prompt_template=refine_template,
)
result = refiner.evaluate(["section 1...", "section 2...", "section 3..."])
Related Pages