Implementation:Arize ai Phoenix Evaluation DataFrame Schema

Knowledge Sources	Phoenix
Domains	LLM Evaluation, Data Engineering, DataFrame Construction
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete pattern for constructing pandas.DataFrame instances whose columns align with Phoenix evaluator input fields, enabling seamless execution of evaluate_dataframe() and async_evaluate_dataframe().

Description

Phoenix evaluators discover their required input fields at construction time (from prompt templates, function signatures, or explicit Pydantic schemas). When evaluate_dataframe() is called, each row of the DataFrame is converted to a dictionary and validated against the evaluator's input schema. This implementation documents the conventions for constructing DataFrames that pass validation, and the techniques for bridging schema mismatches using bind_evaluator().

Usage

Use this pattern when you are:

Building a test dataset for LLM evaluation from raw logs, exported traces, or manual annotations.
Preparing data from a production observability pipeline for offline evaluation.
Combining inputs for multiple evaluators into a single DataFrame.
Troubleshooting validation errors that arise when column names do not match evaluator expectations.

Code Reference

Source Location

Repository: Phoenix
File: User code (no single source file); relies on conventions enforced by packages/phoenix-evals/src/phoenix/evals/evaluators.py

Import

import pandas as pd
from phoenix.evals import (
    ClassificationEvaluator,
    create_evaluator,
    bind_evaluator,
    evaluate_dataframe,
    LLM,
)

I/O Contract

Inputs

Name	Type	Required	Description
DataFrame columns	`pd.Series` (per column)	Yes (for required fields)	Each column name must correspond to a required input field declared by the evaluator, unless an `input_mapping` is bound.
Column data types	`str` (for LLM evaluators), any type (for code evaluators)	Yes	LLM evaluators coerce all input fields to `EnforcedString`. Code evaluators accept the types declared in their function signatures.

Outputs

Name	Type	Description
Prepared DataFrame	`pd.DataFrame`	A DataFrame ready to be passed to `evaluate_dataframe()` or `async_evaluate_dataframe()`.

Input Field Discovery

The following table summarizes how each evaluator type discovers its required input fields and how to determine the expected column names:

Evaluator Type	Field Discovery Mechanism	How to Inspect
`ClassificationEvaluator` / `LLMEvaluator`	Template variable names (e.g., `{question}`, `{answer}`)	`evaluator.prompt_template.variables`
`create_evaluator`-decorated function	Function parameter names	`evaluator.input_schema.model_fields`
Any evaluator with explicit `input_schema`	Pydantic model field names	`evaluator.input_schema.model_fields`
Any evaluator with `bind_evaluator`	Mapping target keys	Inspect the bound `_input_mapping` dictionary

Usage Examples

Direct Column Match

import pandas as pd
from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o")

evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template=(
        "Question: {question}\nAnswer: {answer}\n"
        "Rate the relevance of the answer to the question."
    ),
    choices={"relevant": 1.0, "not_relevant": 0.0},
)

# DataFrame columns ("question", "answer") match template variables exactly
df = pd.DataFrame({
    "question": [
        "What is photosynthesis?",
        "Explain gravity.",
    ],
    "answer": [
        "Photosynthesis converts sunlight into chemical energy.",
        "I like pizza.",
    ],
})

Remapping Columns with bind_evaluator

import pandas as pd
from phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator

llm = LLM(provider="openai", model="gpt-4o")

evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template="Question: {question}\nAnswer: {answer}\nRate relevance.",
    choices={"relevant": 1.0, "not_relevant": 0.0},
)

# DataFrame has "query" and "response" columns instead of "question" and "answer"
df = pd.DataFrame({
    "query": ["What is photosynthesis?", "Explain gravity."],
    "response": [
        "Photosynthesis converts sunlight into chemical energy.",
        "I like pizza.",
    ],
})

# Remap columns to match evaluator expectations
bound_evaluator = bind_evaluator(
    evaluator=evaluator,
    input_mapping={"question": "query", "answer": "response"},
)

Computed Fields with Lambda Mappings

import pandas as pd
from phoenix.evals import create_evaluator, bind_evaluator

@create_evaluator(name="context_check")
def context_check(question: str, context: str) -> bool:
    return question.lower().split()[0] in context.lower()

# DataFrame has multiple document columns that must be merged
df = pd.DataFrame({
    "question": ["What is AI?", "How does ML work?"],
    "doc_1": ["AI is artificial intelligence.", "ML uses data."],
    "doc_2": ["AI mimics human thinking.", "ML finds patterns."],
})

bound = bind_evaluator(
    evaluator=context_check,
    input_mapping={
        "question": "question",
        "context": lambda row: row["doc_1"] + " " + row["doc_2"],
    },
)

Multi-Evaluator DataFrame

import pandas as pd
from phoenix.evals import (
    ClassificationEvaluator,
    create_evaluator,
    evaluate_dataframe,
    LLM,
)

llm = LLM(provider="openai", model="gpt-4o")

# LLM evaluator expects "question" and "answer"
relevance = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template="Question: {question}\nAnswer: {answer}\nRate relevance.",
    choices={"relevant": 1.0, "not_relevant": 0.0},
)

# Code evaluator also expects "answer"
@create_evaluator(name="answer_length")
def answer_length(answer: str) -> int:
    return len(answer.split())

# Single DataFrame serves both evaluators
df = pd.DataFrame({
    "question": ["What is AI?", "Explain ML."],
    "answer": [
        "AI is artificial intelligence that mimics human cognition.",
        "ML is a subset of AI that learns from data.",
    ],
})

results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=[relevance, answer_length],
)

Cleaning Data Before Evaluation

import pandas as pd

df = pd.DataFrame({
    "question": ["What is AI?", None, "Explain ML."],
    "answer": ["AI is...", "Missing question", "ML is..."],
})

# Drop rows with null required fields
df_clean = df.dropna(subset=["question", "answer"])

# Replace empty strings if needed
df_clean = df_clean[df_clean["question"].str.strip() != ""]
df_clean = df_clean[df_clean["answer"].str.strip() != ""]

Related Pages

Implements Principle

Principle:Arize_ai_Phoenix_Evaluation_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment