Principle:Evidentlyai Evidently Data Schema Definition
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Monitoring |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
A data schema definition mechanism that maps column types and roles in datasets for correct evaluation processing.
Description
Data Schema Definition is the process of explicitly mapping column types (numerical, categorical, text, datetime) and roles (target, prediction, timestamp, ID) in a dataset before running evaluations. This bridges the gap between raw tabular data and Evidently's evaluation engine, which requires knowledge of column semantics to select appropriate metrics, drift detection methods, and statistical tests.
Without explicit schema definition, Evidently attempts auto-inference from pandas DataFrame dtypes, which may incorrectly classify columns (e.g., treating encoded categorical IDs as numerical). Explicit schema definition ensures:
- Correct metric selection (e.g., PSI for categorical drift vs. KS test for numerical drift)
- Proper column role assignment (target vs. feature vs. prediction)
- Text column identification for descriptor-based evaluation
- Task-specific configuration (classification, regression, ranking)
Usage
Use this principle when preparing any dataset for Evidently evaluation. It is the mandatory first step before creating an Evidently Dataset object. Apply explicit schema definition when:
- The dataset contains mixed column types that cannot be reliably auto-inferred
- Text columns need to be identified for descriptor-based analysis
- ML task columns (target, prediction, probabilities) must be mapped
- Embedding columns need to be grouped
Theoretical Basis
Data schema definition follows the metadata-driven processing pattern common in data engineering frameworks. The key idea is separating data content from data semantics:
# Pseudocode: Schema-driven processing
schema = define_schema(
column_types={col: type for col in columns},
column_roles={col: role for col in columns},
task_configs=[classification_config, regression_config]
)
dataset = wrap_data(raw_dataframe, schema)
# Now the evaluation engine knows HOW to process each column
This separation enables:
- Type-safe metric dispatch: Each metric knows which column types it supports
- Role-based evaluation: Target/prediction pairs are resolved automatically
- Multi-task support: A single dataset can define multiple evaluation tasks