Principle:Interpretml Interpret Data Preparation And Validation
| Metadata | |
|---|---|
| Sources | InterpretML, InterpretML Docs |
| Domains | Data_Preprocessing, Machine_Learning |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A data validation and normalization procedure that converts heterogeneous input formats into standardized numerical arrays suitable for machine learning model training.
Description
Data Preparation and Validation ensures that raw user-provided data (DataFrames, lists, arrays, sparse matrices, masked arrays) is cleaned, validated, and converted into a consistent internal representation. It checks for dimension mismatches, handles missing values, identifies feature types (continuous, nominal, ordinal), and resolves init_scores from models or arrays. This step is critical because EBMs require strict dimensional consistency between features X, targets y, and sample weights.
Usage
Use this principle at the beginning of any EBM training pipeline when raw user data needs to be transformed into validated numpy arrays. It should be applied whenever data enters the system from external sources where format and quality are not guaranteed.
Theoretical Basis
Data preparation follows a defensive validation pattern:
- Accept any array-like input (list, tuple, DataFrame, Series, sparse matrix, masked array)
- Validate dimensionality (1D for targets/weights, 2D for features)
- Detect and encode feature types (continuous floats, nominal categories)
- Ensure consistent sample counts across X, y, and sample_weight
- Handle edge cases: NaN values, infinity, empty arrays, single-sample inputs
Pseudocode
validate_dimensions(data):
if data is DataFrame or Series: extract numpy array
if data is masked array: handle mask
check ndim matches expected (1D or 2D)
return clean numpy array