Principle:Haifengl Smile Data Transformation

Overview

Data Transformation is the principle of applying mathematical functions to DataFrame columns to prepare features for machine learning algorithms. In Smile, transformations are modeled as first-class objects that implement the Transform functional interface, enabling composable pipelines and invertible operations. The library supports standardization (z-score normalization), min-max scaling, robust scaling, row-wise normalization (L1, L2, L-infinity norms), max-absolute scaling, and Winsor scaling.

The core design insight is that transformations should be fit to training data and then applied uniformly to both training and test data, preventing data leakage. This fit-then-apply pattern mirrors the estimator-transformer pattern found in scikit-learn but is expressed through Java functional interfaces and composition.

Theoretical Basis

Why Transform?

Most ML algorithms assume that input features are on comparable scales. When features have widely different ranges, algorithms that rely on distance metrics (k-NN, SVM, k-means) or gradient-based optimization (neural networks, logistic regression) will be dominated by large-scale features. Feature scaling addresses this by mapping all features to a common range or distribution.

Standardization (Z-Score Normalization)

Standardization transforms each feature to have zero mean and unit variance:

$z_{i} = \frac{x_{i} - μ}{σ}$

where $μ = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ is the sample mean and $σ = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - μ)^{2}}$ is the sample standard deviation.

After transformation: $𝔼 [z] = 0$ and $Var (z) = 1$ .

The inverse transform recovers the original value: $x_{i} = z_{i} \cdot σ + μ$ .

In Smile, this is implemented by Standardizer.fit(data, columns).

Min-Max Scaling

Min-max scaling linearly maps values to the interval $[0, 1]$ :

$x'_{i} = \frac{x_{i} - x_{\min}}{x_{\max} - x_{\min}}$

where $x_{\min}$ and $x_{\max}$ are the observed minimum and maximum. Values are clipped to $[0, 1]$ if new data falls outside the training range.

The inverse: $x_{i} = x'_{i} \cdot (x_{\max} - x_{\min}) + x_{\min}$ .

In Smile, this is implemented by Scaler.fit(data, columns).

Robust Standardization

Robust standardization uses the median and interquartile range (IQR) instead of mean and standard deviation, making it resistant to outliers:

$z_{i} = \frac{x_{i} - median}{Q_{3} - Q_{1}}$

where $Q_{1}$ and $Q_{3}$ are the first and third quartiles.

In Smile, this is implemented by RobustStandardizer.fit(data, columns).

Row-Wise Normalization

Normalizes each sample (row) to unit norm:

L1 norm: $x'_{i} = \frac{x_{i}}{\sum_{j} | x_{j} |}$
L2 norm: $x'_{i} = \frac{x_{i}}{\sqrt{\sum_{j} x_{j}^{2}}}$
L-infinity norm: $x'_{i} = \frac{x_{i}}{\max_{j} | x_{j} |}$

In Smile, this is implemented by new Normalizer(Norm.L2, columns).

Max-Absolute Scaling

Scales each feature by its maximum absolute value, mapping to $[- 1, 1]$ :

$x'_{i} = \frac{x_{i}}{\max (| x_{j} |)}$

In Smile, this is implemented by MaxAbsScaler.fit(data, columns).

Pipeline Composition

Transforms in Smile are composable through the Transform functional interface, which extends Function<Tuple, Tuple>. Two composition mechanisms are provided:

Sequential Composition

The andThen() method chains transforms:

$T_{composed} = T_{2} \circ T_{1}$

meaning $T_{composed} (x) = T_{2} (T_{1} (x))$ .

Pipeline Construction

The static Transform.pipeline(transforms...) method creates a composed transform from multiple stages:

$T_{pipeline} = T_{n} \circ T_{n - 1} \circ \dots \circ T_{1}$

Fit Pipeline

The static Transform.fit(data, trainers...) method implements the fit-then-apply pattern for multi-stage pipelines. Each trainer receives the data as transformed by all previous stages:

Fit $T_{1}$ on original data $D$
Apply $T_{1}$ to get $D_{1} = T_{1} (D)$
Fit $T_{2}$ on $D_{1}$
Compose: $T = T_{2} \circ T_{1}$
Continue for all trainers

This ensures that each transformer sees data in the correct scale from previous stages.

Invertibility

Transforms that implement InvertibleTransform support the inverse operation:

$T^{- 1} (T (x)) = x$

This is essential for:

Interpreting predictions -- Converting model outputs back to the original scale.
Visualization -- Displaying results in human-readable units.
Post-processing -- Undoing normalization after prediction.

The column-level implementation stores both the forward function $f (x)$ and its inverse $f^{- 1} (x)$ for each transformed column.

Relationship to the Data Loading Pipeline

Data Transformation is the fourth stage of the Smile Data Loading Pipeline:

File Data Loading -- Read data from files.
DataFrame Inspection -- Examine structure and metadata.
Column Selection and Filtering -- Select relevant columns.
Data Transformation -- Normalize and scale features. (current)
Numerical Conversion -- Convert to numerical arrays/matrices.

Transformation follows column selection (which determines which features to include) and precedes numerical conversion (which produces the final double[][] or DenseMatrix for algorithms).

Related Pages

Implementation:Haifengl_Smile_Transform_Pipeline

Knowledge Sources

Smile

Metadata

Property	Value
Domains	Data_Engineering, ETL
Workflow	Data_Loading_Pipeline
Stage	4 of 5
Last Updated	2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment