Implementation:Arize ai Phoenix Evaluation DataFrame Schema
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Data Engineering, DataFrame Construction |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete pattern for constructing pandas.DataFrame instances whose columns align with Phoenix evaluator input fields, enabling seamless execution of evaluate_dataframe() and async_evaluate_dataframe().
Description
Phoenix evaluators discover their required input fields at construction time (from prompt templates, function signatures, or explicit Pydantic schemas). When evaluate_dataframe() is called, each row of the DataFrame is converted to a dictionary and validated against the evaluator's input schema. This implementation documents the conventions for constructing DataFrames that pass validation, and the techniques for bridging schema mismatches using bind_evaluator().
Usage
Use this pattern when you are:
- Building a test dataset for LLM evaluation from raw logs, exported traces, or manual annotations.
- Preparing data from a production observability pipeline for offline evaluation.
- Combining inputs for multiple evaluators into a single DataFrame.
- Troubleshooting validation errors that arise when column names do not match evaluator expectations.
Code Reference
Source Location
- Repository: Phoenix
- File: User code (no single source file); relies on conventions enforced by
packages/phoenix-evals/src/phoenix/evals/evaluators.py
Import
import pandas as pd
from phoenix.evals import (
ClassificationEvaluator,
create_evaluator,
bind_evaluator,
evaluate_dataframe,
LLM,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| DataFrame columns | pd.Series (per column) |
Yes (for required fields) | Each column name must correspond to a required input field declared by the evaluator, unless an input_mapping is bound.
|
| Column data types | str (for LLM evaluators), any type (for code evaluators) |
Yes | LLM evaluators coerce all input fields to EnforcedString. Code evaluators accept the types declared in their function signatures.
|
Outputs
| Name | Type | Description |
|---|---|---|
| Prepared DataFrame | pd.DataFrame |
A DataFrame ready to be passed to evaluate_dataframe() or async_evaluate_dataframe().
|
Input Field Discovery
The following table summarizes how each evaluator type discovers its required input fields and how to determine the expected column names:
| Evaluator Type | Field Discovery Mechanism | How to Inspect |
|---|---|---|
ClassificationEvaluator / LLMEvaluator |
Template variable names (e.g., {question}, {answer}) |
evaluator.prompt_template.variables
|
create_evaluator-decorated function |
Function parameter names | evaluator.input_schema.model_fields
|
Any evaluator with explicit input_schema |
Pydantic model field names | evaluator.input_schema.model_fields
|
Any evaluator with bind_evaluator |
Mapping target keys | Inspect the bound _input_mapping dictionary
|
Usage Examples
Direct Column Match
import pandas as pd
from phoenix.evals import ClassificationEvaluator, LLM
llm = LLM(provider="openai", model="gpt-4o")
evaluator = ClassificationEvaluator(
name="relevance",
llm=llm,
prompt_template=(
"Question: {question}\nAnswer: {answer}\n"
"Rate the relevance of the answer to the question."
),
choices={"relevant": 1.0, "not_relevant": 0.0},
)
# DataFrame columns ("question", "answer") match template variables exactly
df = pd.DataFrame({
"question": [
"What is photosynthesis?",
"Explain gravity.",
],
"answer": [
"Photosynthesis converts sunlight into chemical energy.",
"I like pizza.",
],
})
Remapping Columns with bind_evaluator
import pandas as pd
from phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator
llm = LLM(provider="openai", model="gpt-4o")
evaluator = ClassificationEvaluator(
name="relevance",
llm=llm,
prompt_template="Question: {question}\nAnswer: {answer}\nRate relevance.",
choices={"relevant": 1.0, "not_relevant": 0.0},
)
# DataFrame has "query" and "response" columns instead of "question" and "answer"
df = pd.DataFrame({
"query": ["What is photosynthesis?", "Explain gravity."],
"response": [
"Photosynthesis converts sunlight into chemical energy.",
"I like pizza.",
],
})
# Remap columns to match evaluator expectations
bound_evaluator = bind_evaluator(
evaluator=evaluator,
input_mapping={"question": "query", "answer": "response"},
)
Computed Fields with Lambda Mappings
import pandas as pd
from phoenix.evals import create_evaluator, bind_evaluator
@create_evaluator(name="context_check")
def context_check(question: str, context: str) -> bool:
return question.lower().split()[0] in context.lower()
# DataFrame has multiple document columns that must be merged
df = pd.DataFrame({
"question": ["What is AI?", "How does ML work?"],
"doc_1": ["AI is artificial intelligence.", "ML uses data."],
"doc_2": ["AI mimics human thinking.", "ML finds patterns."],
})
bound = bind_evaluator(
evaluator=context_check,
input_mapping={
"question": "question",
"context": lambda row: row["doc_1"] + " " + row["doc_2"],
},
)
Multi-Evaluator DataFrame
import pandas as pd
from phoenix.evals import (
ClassificationEvaluator,
create_evaluator,
evaluate_dataframe,
LLM,
)
llm = LLM(provider="openai", model="gpt-4o")
# LLM evaluator expects "question" and "answer"
relevance = ClassificationEvaluator(
name="relevance",
llm=llm,
prompt_template="Question: {question}\nAnswer: {answer}\nRate relevance.",
choices={"relevant": 1.0, "not_relevant": 0.0},
)
# Code evaluator also expects "answer"
@create_evaluator(name="answer_length")
def answer_length(answer: str) -> int:
return len(answer.split())
# Single DataFrame serves both evaluators
df = pd.DataFrame({
"question": ["What is AI?", "Explain ML."],
"answer": [
"AI is artificial intelligence that mimics human cognition.",
"ML is a subset of AI that learns from data.",
],
})
results_df = evaluate_dataframe(
dataframe=df,
evaluators=[relevance, answer_length],
)
Cleaning Data Before Evaluation
import pandas as pd
df = pd.DataFrame({
"question": ["What is AI?", None, "Explain ML."],
"answer": ["AI is...", "Missing question", "ML is..."],
})
# Drop rows with null required fields
df_clean = df.dropna(subset=["question", "answer"])
# Replace empty strings if needed
df_clean = df_clean[df_clean["question"].str.strip() != ""]
df_clean = df_clean[df_clean["answer"].str.strip() != ""]