Principle:Mlflow Mlflow Evaluation Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, LLM_Evaluation |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Standardizing heterogeneous evaluation data into a uniform tabular format so that scorers and evaluation harnesses can process it consistently.
Description
When evaluating generative AI applications, evaluation data can arrive in many forms: hand-crafted dictionaries, exported DataFrames, lists of recorded trace objects, managed dataset entities, or even Spark DataFrames from distributed pipelines. Each format encodes the same conceptual information -- model inputs, model outputs, ground-truth expectations, and execution traces -- but uses different structural conventions.
The principle of evaluation dataset preparation addresses this heterogeneity by defining a canonical schema that every downstream consumer can rely on. The schema centres on a small set of well-known columns: an inputs column carrying the dictionary of values sent to the model, an optional outputs column with the model's response, an optional expectations column holding ground-truth data for comparison, an optional trace column containing execution trace objects, and an optional tags column for metadata. By normalising all incoming data to this schema before evaluation begins, the rest of the pipeline can operate without branching on data format.
A robust preparation step also validates the incoming data: it rejects empty datasets, ensures that at least one of inputs or trace is present, deserialises JSON-encoded columns when necessary, and extracts request-response pairs from trace objects when explicit inputs and outputs are absent. This defensive normalisation prevents subtle failures deep inside the scoring loop and gives clear, early error messages when the data contract is violated.
Usage
Apply this principle whenever assembling data for model evaluation. It is relevant when migrating data between storage formats (e.g., Spark to pandas), when replaying previously recorded traces through new scorers, or when constructing test datasets by hand during development. Any time evaluation data crosses a boundary between systems, explicit normalisation to the canonical schema prevents misinterpretation of column semantics.
Theoretical Basis
The core concept is schema normalisation: mapping disparate input representations into a single canonical form. In data engineering this is sometimes called a canonical data model pattern. The key invariants maintained by the normalisation are:
- Column presence: at least one of
inputsortracemust exist. - Column typing:
inputsmust be a dictionary (or JSON-deserialised to one),expectationsmust be a dictionary when present, andtracemust be a Trace object. - Non-emptiness: the dataset must contain at least one row.
- Derivation from trace: when a trace column is present but explicit inputs/outputs are missing, the normalisation step extracts them from the trace's root span.
Pseudocode for the normalisation:
function normalise(data):
df = coerce_to_dataframe(data) # handles list[dict], Spark DF, Trace list, etc.
assert len(df) > 0 # reject empty datasets
assert "inputs" in df.columns or "trace" in df.columns
df = deserialise_trace_column(df)
df = extract_inputs_outputs_from_trace(df)
df = extract_expectations_from_trace(df)
return df