Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Arize ai Phoenix Evaluation Data Preparation

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Data Engineering, DataFrame Construction
Last Updated 2026-02-14 00:00 GMT

Overview

Evaluation data preparation is the practice of structuring raw LLM inputs, outputs, and reference materials into a tabular format whose column names align with the input fields expected by each evaluator in the pipeline.

Description

An LLM evaluation pipeline consumes structured data: each row represents one evaluation case, and each column provides a field that one or more evaluators need. If the column names do not match the evaluator's expected input field names, the pipeline raises a validation error or produces incorrect results.

Evaluation data preparation ensures that:

  • Column names match evaluator input fields. Every evaluator declares its required input fields either through an explicit Pydantic input_schema, through variables extracted from a prompt template, or through the parameter names of a decorated function. The DataFrame columns must match these field names or be remapped through bind_evaluator().
  • Data types are compatible. LLM evaluators expect string values for prompt template placeholders (enforced via EnforcedString coercion), while code-based evaluators accept whatever types their function signatures declare.
  • Missing data is handled. Required fields that are missing from a row will cause validation failures. Optional fields (those with defaults in the Pydantic schema) may be absent.
  • Multiple evaluators share the same DataFrame. A single DataFrame can be evaluated by multiple evaluators simultaneously. Each evaluator reads only the columns it needs, making it possible to include data for several evaluators in one table.

Usage

Use evaluation data preparation when you need to:

  • Construct a DataFrame from LLM application logs, exported traces, or manually curated test cases.
  • Validate that your data schema matches the evaluators you intend to run before executing the pipeline.
  • Rename or transform columns to bridge the gap between your data model and the evaluator's expected fields.
  • Combine data for multiple evaluators into a single DataFrame to avoid redundant storage and processing.

Theoretical Basis

Input Field Resolution

The Phoenix evaluator framework resolves required input fields through three mechanisms, applied in order of priority:

1. Explicit input_schema (Pydantic BaseModel)
   --> Required fields = model fields where field.is_required() is True

2. Prompt template variables (LLMEvaluator / ClassificationEvaluator)
   --> Required fields = set of placeholder variable names in the template
   --> A dynamic Pydantic model is created with all variables as required str fields

3. Function parameter names (create_evaluator decorator)
   --> Required fields = parameters from inspect.signature(fn)
   --> A dynamic Pydantic model is created preserving types and defaults

If no input schema or input mapping is available, the evaluator raises a ValueError.

Input Validation and Remapping

When evaluate() or async_evaluate() is called, the framework:

  1. Determines the set of required fields from the input schema or input mapping.
  2. Applies the input mapping (if bound via bind_evaluator()) to remap and/or transform field values from the provided record to the evaluator's expected field names.
  3. Validates the remapped input against the Pydantic schema, coercing types where possible (e.g., non-string values are coerced to EnforcedString for LLM evaluators).
  4. Passes the validated input to the evaluator's _evaluate() method.

Column-to-Field Alignment Patterns

Scenario Solution
DataFrame columns exactly match evaluator fields No additional action needed.
Columns have different names Use bind_evaluator(evaluator, input_mapping={"eval_field": "df_column"}).
A field must be computed from multiple columns Use a callable in the mapping: {"field": lambda row: row["col_a"] + " " + row["col_b"]}.
Nested data must be flattened Use dot-path strings in the mapping: {"field": "response.text"}.
A field has a default value Declare the field as optional in a custom Pydantic input_schema.

Data Quality Considerations

  • Empty strings are valid but may produce low-quality LLM evaluations. Consider filtering or flagging rows with empty required fields before evaluation.
  • Very long text may exceed model context windows. Truncation or summarization preprocessing may be needed.
  • Null/NaN values in required columns will fail validation. Clean these before passing to evaluate_dataframe().

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment