Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Evidentlyai Evidently Data Schema Definition

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ML_Monitoring
Last Updated 2026-02-14 12:00 GMT

Overview

A data schema definition mechanism that maps column types and roles in datasets for correct evaluation processing.

Description

Data Schema Definition is the process of explicitly mapping column types (numerical, categorical, text, datetime) and roles (target, prediction, timestamp, ID) in a dataset before running evaluations. This bridges the gap between raw tabular data and Evidently's evaluation engine, which requires knowledge of column semantics to select appropriate metrics, drift detection methods, and statistical tests.

Without explicit schema definition, Evidently attempts auto-inference from pandas DataFrame dtypes, which may incorrectly classify columns (e.g., treating encoded categorical IDs as numerical). Explicit schema definition ensures:

  • Correct metric selection (e.g., PSI for categorical drift vs. KS test for numerical drift)
  • Proper column role assignment (target vs. feature vs. prediction)
  • Text column identification for descriptor-based evaluation
  • Task-specific configuration (classification, regression, ranking)

Usage

Use this principle when preparing any dataset for Evidently evaluation. It is the mandatory first step before creating an Evidently Dataset object. Apply explicit schema definition when:

  • The dataset contains mixed column types that cannot be reliably auto-inferred
  • Text columns need to be identified for descriptor-based analysis
  • ML task columns (target, prediction, probabilities) must be mapped
  • Embedding columns need to be grouped

Theoretical Basis

Data schema definition follows the metadata-driven processing pattern common in data engineering frameworks. The key idea is separating data content from data semantics:

# Pseudocode: Schema-driven processing
schema = define_schema(
    column_types={col: type for col in columns},
    column_roles={col: role for col in columns},
    task_configs=[classification_config, regression_config]
)
dataset = wrap_data(raw_dataframe, schema)
# Now the evaluation engine knows HOW to process each column

This separation enables:

  • Type-safe metric dispatch: Each metric knows which column types it supports
  • Role-based evaluation: Target/prediction pairs are resolved automatically
  • Multi-task support: A single dataset can define multiple evaluation tasks

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment