Principle:Haifengl Smile Data Transformation
Overview
Data Transformation is the principle of applying mathematical functions to DataFrame columns to prepare features for machine learning algorithms. In Smile, transformations are modeled as first-class objects that implement the Transform functional interface, enabling composable pipelines and invertible operations. The library supports standardization (z-score normalization), min-max scaling, robust scaling, row-wise normalization (L1, L2, L-infinity norms), max-absolute scaling, and Winsor scaling.
The core design insight is that transformations should be fit to training data and then applied uniformly to both training and test data, preventing data leakage. This fit-then-apply pattern mirrors the estimator-transformer pattern found in scikit-learn but is expressed through Java functional interfaces and composition.
Theoretical Basis
Why Transform?
Most ML algorithms assume that input features are on comparable scales. When features have widely different ranges, algorithms that rely on distance metrics (k-NN, SVM, k-means) or gradient-based optimization (neural networks, logistic regression) will be dominated by large-scale features. Feature scaling addresses this by mapping all features to a common range or distribution.
Standardization (Z-Score Normalization)
Standardization transforms each feature to have zero mean and unit variance:
where is the sample mean and is the sample standard deviation.
After transformation: and .
The inverse transform recovers the original value: .
In Smile, this is implemented by Standardizer.fit(data, columns).
Min-Max Scaling
Min-max scaling linearly maps values to the interval :
where and are the observed minimum and maximum. Values are clipped to if new data falls outside the training range.
The inverse: .
In Smile, this is implemented by Scaler.fit(data, columns).
Robust Standardization
Robust standardization uses the median and interquartile range (IQR) instead of mean and standard deviation, making it resistant to outliers:
where and are the first and third quartiles.
In Smile, this is implemented by RobustStandardizer.fit(data, columns).
Row-Wise Normalization
Normalizes each sample (row) to unit norm:
- L1 norm:
- L2 norm:
- L-infinity norm:
In Smile, this is implemented by new Normalizer(Norm.L2, columns).
Max-Absolute Scaling
Scales each feature by its maximum absolute value, mapping to :
In Smile, this is implemented by MaxAbsScaler.fit(data, columns).
Pipeline Composition
Transforms in Smile are composable through the Transform functional interface, which extends Function<Tuple, Tuple>. Two composition mechanisms are provided:
Sequential Composition
The andThen() method chains transforms:
meaning .
Pipeline Construction
The static Transform.pipeline(transforms...) method creates a composed transform from multiple stages:
Fit Pipeline
The static Transform.fit(data, trainers...) method implements the fit-then-apply pattern for multi-stage pipelines. Each trainer receives the data as transformed by all previous stages:
- Fit on original data
- Apply to get
- Fit on
- Compose:
- Continue for all trainers
This ensures that each transformer sees data in the correct scale from previous stages.
Invertibility
Transforms that implement InvertibleTransform support the inverse operation:
This is essential for:
- Interpreting predictions -- Converting model outputs back to the original scale.
- Visualization -- Displaying results in human-readable units.
- Post-processing -- Undoing normalization after prediction.
The column-level implementation stores both the forward function and its inverse for each transformed column.
Relationship to the Data Loading Pipeline
Data Transformation is the fourth stage of the Smile Data Loading Pipeline:
- File Data Loading -- Read data from files.
- DataFrame Inspection -- Examine structure and metadata.
- Column Selection and Filtering -- Select relevant columns.
- Data Transformation -- Normalize and scale features. (current)
- Numerical Conversion -- Convert to numerical arrays/matrices.
Transformation follows column selection (which determines which features to include) and precedes numerical conversion (which produces the final double[][] or DenseMatrix for algorithms).
Related Pages
Knowledge Sources
Metadata
| Property | Value |
|---|---|
| Domains | Data_Engineering, ETL |
| Workflow | Data_Loading_Pipeline |
| Stage | 4 of 5 |
| Last Updated | 2026-02-08 22:00 GMT |