Principle:Arize ai Phoenix Evaluation Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Data Engineering, DataFrame Construction |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Evaluation data preparation is the practice of structuring raw LLM inputs, outputs, and reference materials into a tabular format whose column names align with the input fields expected by each evaluator in the pipeline.
Description
An LLM evaluation pipeline consumes structured data: each row represents one evaluation case, and each column provides a field that one or more evaluators need. If the column names do not match the evaluator's expected input field names, the pipeline raises a validation error or produces incorrect results.
Evaluation data preparation ensures that:
- Column names match evaluator input fields. Every evaluator declares its required input fields either through an explicit Pydantic
input_schema, through variables extracted from a prompt template, or through the parameter names of a decorated function. The DataFrame columns must match these field names or be remapped throughbind_evaluator(). - Data types are compatible. LLM evaluators expect string values for prompt template placeholders (enforced via
EnforcedStringcoercion), while code-based evaluators accept whatever types their function signatures declare. - Missing data is handled. Required fields that are missing from a row will cause validation failures. Optional fields (those with defaults in the Pydantic schema) may be absent.
- Multiple evaluators share the same DataFrame. A single DataFrame can be evaluated by multiple evaluators simultaneously. Each evaluator reads only the columns it needs, making it possible to include data for several evaluators in one table.
Usage
Use evaluation data preparation when you need to:
- Construct a DataFrame from LLM application logs, exported traces, or manually curated test cases.
- Validate that your data schema matches the evaluators you intend to run before executing the pipeline.
- Rename or transform columns to bridge the gap between your data model and the evaluator's expected fields.
- Combine data for multiple evaluators into a single DataFrame to avoid redundant storage and processing.
Theoretical Basis
Input Field Resolution
The Phoenix evaluator framework resolves required input fields through three mechanisms, applied in order of priority:
1. Explicit input_schema (Pydantic BaseModel)
--> Required fields = model fields where field.is_required() is True
2. Prompt template variables (LLMEvaluator / ClassificationEvaluator)
--> Required fields = set of placeholder variable names in the template
--> A dynamic Pydantic model is created with all variables as required str fields
3. Function parameter names (create_evaluator decorator)
--> Required fields = parameters from inspect.signature(fn)
--> A dynamic Pydantic model is created preserving types and defaults
If no input schema or input mapping is available, the evaluator raises a ValueError.
Input Validation and Remapping
When evaluate() or async_evaluate() is called, the framework:
- Determines the set of required fields from the input schema or input mapping.
- Applies the input mapping (if bound via
bind_evaluator()) to remap and/or transform field values from the provided record to the evaluator's expected field names. - Validates the remapped input against the Pydantic schema, coercing types where possible (e.g., non-string values are coerced to
EnforcedStringfor LLM evaluators). - Passes the validated input to the evaluator's
_evaluate()method.
Column-to-Field Alignment Patterns
| Scenario | Solution |
|---|---|
| DataFrame columns exactly match evaluator fields | No additional action needed. |
| Columns have different names | Use bind_evaluator(evaluator, input_mapping={"eval_field": "df_column"}).
|
| A field must be computed from multiple columns | Use a callable in the mapping: {"field": lambda row: row["col_a"] + " " + row["col_b"]}.
|
| Nested data must be flattened | Use dot-path strings in the mapping: {"field": "response.text"}.
|
| A field has a default value | Declare the field as optional in a custom Pydantic input_schema.
|
Data Quality Considerations
- Empty strings are valid but may produce low-quality LLM evaluations. Consider filtering or flagging rows with empty required fields before evaluation.
- Very long text may exceed model context windows. Truncation or summarization preprocessing may be needed.
- Null/NaN values in required columns will fail validation. Clean these before passing to
evaluate_dataframe().