Principle:Mlflow Mlflow Evaluation Dataset Preparation

Knowledge Sources	MLflow GenAI Evaluation MLflow
Domains	ML_Ops, LLM_Evaluation
Last Updated	2026-02-13 20:00 GMT

Overview

Standardizing heterogeneous evaluation data into a uniform tabular format so that scorers and evaluation harnesses can process it consistently.

Description

When evaluating generative AI applications, evaluation data can arrive in many forms: hand-crafted dictionaries, exported DataFrames, lists of recorded trace objects, managed dataset entities, or even Spark DataFrames from distributed pipelines. Each format encodes the same conceptual information -- model inputs, model outputs, ground-truth expectations, and execution traces -- but uses different structural conventions.

The principle of evaluation dataset preparation addresses this heterogeneity by defining a canonical schema that every downstream consumer can rely on. The schema centres on a small set of well-known columns: an inputs column carrying the dictionary of values sent to the model, an optional outputs column with the model's response, an optional expectations column holding ground-truth data for comparison, an optional trace column containing execution trace objects, and an optional tags column for metadata. By normalising all incoming data to this schema before evaluation begins, the rest of the pipeline can operate without branching on data format.

A robust preparation step also validates the incoming data: it rejects empty datasets, ensures that at least one of inputs or trace is present, deserialises JSON-encoded columns when necessary, and extracts request-response pairs from trace objects when explicit inputs and outputs are absent. This defensive normalisation prevents subtle failures deep inside the scoring loop and gives clear, early error messages when the data contract is violated.

Usage

Apply this principle whenever assembling data for model evaluation. It is relevant when migrating data between storage formats (e.g., Spark to pandas), when replaying previously recorded traces through new scorers, or when constructing test datasets by hand during development. Any time evaluation data crosses a boundary between systems, explicit normalisation to the canonical schema prevents misinterpretation of column semantics.

Theoretical Basis

The core concept is schema normalisation: mapping disparate input representations into a single canonical form. In data engineering this is sometimes called a canonical data model pattern. The key invariants maintained by the normalisation are:

Column presence: at least one of inputs or trace must exist.
Column typing: inputs must be a dictionary (or JSON-deserialised to one), expectations must be a dictionary when present, and trace must be a Trace object.
Non-emptiness: the dataset must contain at least one row.
Derivation from trace: when a trace column is present but explicit inputs/outputs are missing, the normalisation step extracts them from the trace's root span.

Pseudocode for the normalisation:

function normalise(data):
    df = coerce_to_dataframe(data)    # handles list[dict], Spark DF, Trace list, etc.
    assert len(df) > 0                # reject empty datasets
    assert "inputs" in df.columns or "trace" in df.columns
    df = deserialise_trace_column(df)
    df = extract_inputs_outputs_from_trace(df)
    df = extract_expectations_from_trace(df)
    return df

Related Pages

Implemented By

Implementation:Mlflow_Mlflow_Convert_To_Eval_Set

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment