Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haifengl Smile Data Transformation

From Leeroopedia


Overview

Data Transformation is the principle of applying mathematical functions to DataFrame columns to prepare features for machine learning algorithms. In Smile, transformations are modeled as first-class objects that implement the Transform functional interface, enabling composable pipelines and invertible operations. The library supports standardization (z-score normalization), min-max scaling, robust scaling, row-wise normalization (L1, L2, L-infinity norms), max-absolute scaling, and Winsor scaling.

The core design insight is that transformations should be fit to training data and then applied uniformly to both training and test data, preventing data leakage. This fit-then-apply pattern mirrors the estimator-transformer pattern found in scikit-learn but is expressed through Java functional interfaces and composition.

Theoretical Basis

Why Transform?

Most ML algorithms assume that input features are on comparable scales. When features have widely different ranges, algorithms that rely on distance metrics (k-NN, SVM, k-means) or gradient-based optimization (neural networks, logistic regression) will be dominated by large-scale features. Feature scaling addresses this by mapping all features to a common range or distribution.

Standardization (Z-Score Normalization)

Standardization transforms each feature to have zero mean and unit variance:

zi=xiμσ

where μ=1ni=1nxi is the sample mean and σ=1n1i=1n(xiμ)2 is the sample standard deviation.

After transformation: 𝔼[z]=0 and Var(z)=1.

The inverse transform recovers the original value: xi=ziσ+μ.

In Smile, this is implemented by Standardizer.fit(data, columns).

Min-Max Scaling

Min-max scaling linearly maps values to the interval [0,1]:

x'i=xixminxmaxxmin

where xmin and xmax are the observed minimum and maximum. Values are clipped to [0,1] if new data falls outside the training range.

The inverse: xi=x'i(xmaxxmin)+xmin.

In Smile, this is implemented by Scaler.fit(data, columns).

Robust Standardization

Robust standardization uses the median and interquartile range (IQR) instead of mean and standard deviation, making it resistant to outliers:

zi=ximedianQ3Q1

where Q1 and Q3 are the first and third quartiles.

In Smile, this is implemented by RobustStandardizer.fit(data, columns).

Row-Wise Normalization

Normalizes each sample (row) to unit norm:

  • L1 norm: x'i=xij|xj|
  • L2 norm: x'i=xijxj2
  • L-infinity norm: x'i=ximaxj|xj|

In Smile, this is implemented by new Normalizer(Norm.L2, columns).

Max-Absolute Scaling

Scales each feature by its maximum absolute value, mapping to [1,1]:

x'i=ximax(|xj|)

In Smile, this is implemented by MaxAbsScaler.fit(data, columns).

Pipeline Composition

Transforms in Smile are composable through the Transform functional interface, which extends Function<Tuple, Tuple>. Two composition mechanisms are provided:

Sequential Composition

The andThen() method chains transforms:

Tcomposed=T2T1

meaning Tcomposed(x)=T2(T1(x)).

Pipeline Construction

The static Transform.pipeline(transforms...) method creates a composed transform from multiple stages:

Tpipeline=TnTn1T1

Fit Pipeline

The static Transform.fit(data, trainers...) method implements the fit-then-apply pattern for multi-stage pipelines. Each trainer receives the data as transformed by all previous stages:

  1. Fit T1 on original data D
  2. Apply T1 to get D1=T1(D)
  3. Fit T2 on D1
  4. Compose: T=T2T1
  5. Continue for all trainers

This ensures that each transformer sees data in the correct scale from previous stages.

Invertibility

Transforms that implement InvertibleTransform support the inverse operation:

T1(T(x))=x

This is essential for:

  • Interpreting predictions -- Converting model outputs back to the original scale.
  • Visualization -- Displaying results in human-readable units.
  • Post-processing -- Undoing normalization after prediction.

The column-level implementation stores both the forward function f(x) and its inverse f1(x) for each transformed column.

Relationship to the Data Loading Pipeline

Data Transformation is the fourth stage of the Smile Data Loading Pipeline:

  1. File Data Loading -- Read data from files.
  2. DataFrame Inspection -- Examine structure and metadata.
  3. Column Selection and Filtering -- Select relevant columns.
  4. Data Transformation -- Normalize and scale features. (current)
  5. Numerical Conversion -- Convert to numerical arrays/matrices.

Transformation follows column selection (which determines which features to include) and precedes numerical conversion (which produces the final double[][] or DenseMatrix for algorithms).

Related Pages

Knowledge Sources

Metadata

Property Value
Domains Data_Engineering, ETL
Workflow Data_Loading_Pipeline
Stage 4 of 5
Last Updated 2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment