Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Arize ai Phoenix Evaluation DataFrame Schema

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Data Engineering, DataFrame Construction
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete pattern for constructing pandas.DataFrame instances whose columns align with Phoenix evaluator input fields, enabling seamless execution of evaluate_dataframe() and async_evaluate_dataframe().

Description

Phoenix evaluators discover their required input fields at construction time (from prompt templates, function signatures, or explicit Pydantic schemas). When evaluate_dataframe() is called, each row of the DataFrame is converted to a dictionary and validated against the evaluator's input schema. This implementation documents the conventions for constructing DataFrames that pass validation, and the techniques for bridging schema mismatches using bind_evaluator().

Usage

Use this pattern when you are:

  • Building a test dataset for LLM evaluation from raw logs, exported traces, or manual annotations.
  • Preparing data from a production observability pipeline for offline evaluation.
  • Combining inputs for multiple evaluators into a single DataFrame.
  • Troubleshooting validation errors that arise when column names do not match evaluator expectations.

Code Reference

Source Location

  • Repository: Phoenix
  • File: User code (no single source file); relies on conventions enforced by packages/phoenix-evals/src/phoenix/evals/evaluators.py

Import

import pandas as pd
from phoenix.evals import (
    ClassificationEvaluator,
    create_evaluator,
    bind_evaluator,
    evaluate_dataframe,
    LLM,
)

I/O Contract

Inputs

Name Type Required Description
DataFrame columns pd.Series (per column) Yes (for required fields) Each column name must correspond to a required input field declared by the evaluator, unless an input_mapping is bound.
Column data types str (for LLM evaluators), any type (for code evaluators) Yes LLM evaluators coerce all input fields to EnforcedString. Code evaluators accept the types declared in their function signatures.

Outputs

Name Type Description
Prepared DataFrame pd.DataFrame A DataFrame ready to be passed to evaluate_dataframe() or async_evaluate_dataframe().

Input Field Discovery

The following table summarizes how each evaluator type discovers its required input fields and how to determine the expected column names:

Evaluator Type Field Discovery Mechanism How to Inspect
ClassificationEvaluator / LLMEvaluator Template variable names (e.g., {question}, {answer}) evaluator.prompt_template.variables
create_evaluator-decorated function Function parameter names evaluator.input_schema.model_fields
Any evaluator with explicit input_schema Pydantic model field names evaluator.input_schema.model_fields
Any evaluator with bind_evaluator Mapping target keys Inspect the bound _input_mapping dictionary

Usage Examples

Direct Column Match

import pandas as pd
from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o")

evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template=(
        "Question: {question}\nAnswer: {answer}\n"
        "Rate the relevance of the answer to the question."
    ),
    choices={"relevant": 1.0, "not_relevant": 0.0},
)

# DataFrame columns ("question", "answer") match template variables exactly
df = pd.DataFrame({
    "question": [
        "What is photosynthesis?",
        "Explain gravity.",
    ],
    "answer": [
        "Photosynthesis converts sunlight into chemical energy.",
        "I like pizza.",
    ],
})

Remapping Columns with bind_evaluator

import pandas as pd
from phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator

llm = LLM(provider="openai", model="gpt-4o")

evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template="Question: {question}\nAnswer: {answer}\nRate relevance.",
    choices={"relevant": 1.0, "not_relevant": 0.0},
)

# DataFrame has "query" and "response" columns instead of "question" and "answer"
df = pd.DataFrame({
    "query": ["What is photosynthesis?", "Explain gravity."],
    "response": [
        "Photosynthesis converts sunlight into chemical energy.",
        "I like pizza.",
    ],
})

# Remap columns to match evaluator expectations
bound_evaluator = bind_evaluator(
    evaluator=evaluator,
    input_mapping={"question": "query", "answer": "response"},
)

Computed Fields with Lambda Mappings

import pandas as pd
from phoenix.evals import create_evaluator, bind_evaluator

@create_evaluator(name="context_check")
def context_check(question: str, context: str) -> bool:
    return question.lower().split()[0] in context.lower()

# DataFrame has multiple document columns that must be merged
df = pd.DataFrame({
    "question": ["What is AI?", "How does ML work?"],
    "doc_1": ["AI is artificial intelligence.", "ML uses data."],
    "doc_2": ["AI mimics human thinking.", "ML finds patterns."],
})

bound = bind_evaluator(
    evaluator=context_check,
    input_mapping={
        "question": "question",
        "context": lambda row: row["doc_1"] + " " + row["doc_2"],
    },
)

Multi-Evaluator DataFrame

import pandas as pd
from phoenix.evals import (
    ClassificationEvaluator,
    create_evaluator,
    evaluate_dataframe,
    LLM,
)

llm = LLM(provider="openai", model="gpt-4o")

# LLM evaluator expects "question" and "answer"
relevance = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template="Question: {question}\nAnswer: {answer}\nRate relevance.",
    choices={"relevant": 1.0, "not_relevant": 0.0},
)

# Code evaluator also expects "answer"
@create_evaluator(name="answer_length")
def answer_length(answer: str) -> int:
    return len(answer.split())

# Single DataFrame serves both evaluators
df = pd.DataFrame({
    "question": ["What is AI?", "Explain ML."],
    "answer": [
        "AI is artificial intelligence that mimics human cognition.",
        "ML is a subset of AI that learns from data.",
    ],
})

results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=[relevance, answer_length],
)

Cleaning Data Before Evaluation

import pandas as pd

df = pd.DataFrame({
    "question": ["What is AI?", None, "Explain ML."],
    "answer": ["AI is...", "Missing question", "ML is..."],
})

# Drop rows with null required fields
df_clean = df.dropna(subset=["question", "answer"])

# Replace empty strings if needed
df_clean = df_clean[df_clean["question"].str.strip() != ""]
df_clean = df_clean[df_clean["answer"].str.strip() != ""]

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment