Implementation:Haifengl Smile Transform Pipeline
Overview
The Transform Pipeline implementation in Smile consists of three layers of interfaces and classes in the smile.data.transform and smile.feature.transform packages:
Transform-- The core functional interface extendingFunction<Tuple, Tuple>with pipeline composition, batch application, and fit methods.ColumnTransform-- A concrete class that applies per-column transformations via aMap<String, Function>.InvertibleColumnTransform-- ExtendsColumnTransformwith inverse transform support for undoing transformations.
Concrete transformers (Standardizer, Scaler, RobustStandardizer, MaxAbsScaler, WinsorScaler, Normalizer) are factory interfaces in the smile.feature.transform package that produce ColumnTransform or InvertibleColumnTransform instances.
API Summary
| Class/Interface | Method | Return Type | Description |
|---|---|---|---|
Transform |
apply(Tuple) |
Tuple |
Transform a single row |
Transform |
apply(DataFrame) |
DataFrame |
Transform an entire DataFrame |
Transform |
andThen(Transform) |
Transform |
Compose with another transform (this, then after) |
Transform |
compose(Transform) |
Transform |
Compose with another transform (before, then this) |
Transform |
fit(DataFrame, Function...) |
Transform |
Fit a pipeline of trainers sequentially |
Transform |
pipeline(Transform...) |
Transform |
Compose pre-built transforms into a pipeline |
InvertibleTransform |
invert(Tuple) |
Tuple |
Inverse transform a single row |
InvertibleTransform |
invert(DataFrame) |
DataFrame |
Inverse transform an entire DataFrame |
Standardizer |
fit(DataFrame, String...) |
InvertibleColumnTransform |
Fit z-score normalization |
Scaler |
fit(DataFrame, String...) |
InvertibleColumnTransform |
Fit min-max scaling to [0, 1] |
RobustStandardizer |
fit(DataFrame, String...) |
InvertibleColumnTransform |
Fit median/IQR standardization |
MaxAbsScaler |
fit(DataFrame, String...) |
InvertibleColumnTransform |
Fit max-absolute scaling to [-1, 1] |
WinsorScaler |
fit(DataFrame, double, double, String...) |
InvertibleColumnTransform |
Fit Winsorized scaling |
Normalizer |
constructor | Transform |
Row-wise L1/L2/L-inf normalization |
Source Locations
| Class | File | Lines |
|---|---|---|
Transform |
base/src/main/java/smile/data/transform/Transform.java |
L31-97 |
InvertibleTransform |
base/src/main/java/smile/data/transform/InvertibleTransform.java |
L27-41 |
ColumnTransform |
base/src/main/java/smile/data/transform/ColumnTransform.java |
L41-106 |
InvertibleColumnTransform |
base/src/main/java/smile/data/transform/InvertibleColumnTransform.java |
L35-81 |
Standardizer |
core/src/main/java/smile/feature/transform/Standardizer.java |
L37-92 |
Scaler |
core/src/main/java/smile/feature/transform/Scaler.java |
L40-99 |
Normalizer |
core/src/main/java/smile/feature/transform/Normalizer.java |
L36-106 |
| Repository | github.com/haifengl/smile | |
Import
// Core transform interfaces
import smile.data.transform.Transform;
import smile.data.transform.InvertibleTransform;
import smile.data.transform.ColumnTransform;
import smile.data.transform.InvertibleColumnTransform;
// Concrete transformers
import smile.feature.transform.Standardizer;
import smile.feature.transform.Scaler;
import smile.feature.transform.RobustStandardizer;
import smile.feature.transform.MaxAbsScaler;
import smile.feature.transform.WinsorScaler;
import smile.feature.transform.Normalizer;
Type: API Doc
Signature
Transform Interface
public interface Transform extends Function<Tuple, Tuple>, Serializable {
// Fit a pipeline of transforms from trainer functions
@SafeVarargs
static Transform fit(DataFrame data,
Function<DataFrame, Transform>... trainers)
// Compose pre-built transforms into a pipeline
static Transform pipeline(Transform... transforms)
// Apply transform to entire DataFrame (default: maps over rows)
default DataFrame apply(DataFrame data)
// Compose: this transform, then 'after'
default Transform andThen(Transform after)
// Compose: 'before' transform, then this
default Transform compose(Transform before)
}
InvertibleTransform Interface
public interface InvertibleTransform extends Transform {
Tuple invert(Tuple x)
DataFrame invert(DataFrame data)
}
ColumnTransform Class
public class ColumnTransform implements Transform {
public ColumnTransform(String name,
Map<String, Function> transforms)
@Override
public Tuple apply(Tuple x)
@Override
public DataFrame apply(DataFrame data) // optimized column-wise
}
InvertibleColumnTransform Class
public class InvertibleColumnTransform
extends ColumnTransform
implements InvertibleTransform {
public InvertibleColumnTransform(String name,
Map<String, Function> transforms,
Map<String, Function> inverses)
@Override
public Tuple invert(Tuple x)
@Override
public DataFrame invert(DataFrame data)
}
Concrete Transformer Factories
// Z-score normalization: (x - mean) / stdev
public interface Standardizer {
static InvertibleColumnTransform fit(DataFrame data,
String... columns)
}
// Min-max scaling to [0, 1]: (x - min) / (max - min)
public interface Scaler {
static InvertibleColumnTransform fit(DataFrame data,
String... columns)
}
// Row-wise L1/L2/L-inf normalization
public class Normalizer implements Transform {
public enum Norm { L1, L2, L_INF }
public Normalizer(Norm norm, String... columns)
}
Inputs and Outputs
| Method | Input | Output | Notes |
|---|---|---|---|
Transform.fit(data, trainers) |
DataFrame + trainer functions | Composed Transform |
Each trainer sees data from prior stages |
Transform.pipeline(transforms) |
Pre-built Transform objects | Composed Transform |
No fitting; just composition |
transform.apply(data) |
DataFrame |
Transformed DataFrame |
ColumnTransform operates column-wise in parallel |
invertible.invert(data) |
Transformed DataFrame |
Original-scale DataFrame |
Only for InvertibleTransform |
Standardizer.fit(data, cols) |
DataFrame + column names |
InvertibleColumnTransform |
Computes mean/stdev per column |
Scaler.fit(data, cols) |
DataFrame + column names |
InvertibleColumnTransform |
Computes min/max per column |
Usage Examples
Example 1: Standardize all numeric columns
import smile.io.Read;
import smile.data.DataFrame;
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;
DataFrame iris = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Fit standardizer on all numeric columns
InvertibleColumnTransform standardizer =
Standardizer.fit(iris, "sepal_length", "sepal_width",
"petal_length", "petal_width");
// Apply to training data
DataFrame standardized = standardizer.apply(iris);
System.out.println(standardized.head(5));
// Invert back to original scale
DataFrame original = standardizer.invert(standardized);
System.out.println(original.head(5));
Example 2: Min-max scaling to [0, 1]
import smile.io.Read;
import smile.data.DataFrame;
import smile.feature.transform.Scaler;
import smile.data.transform.InvertibleColumnTransform;
DataFrame data = Read.csv("data/housing.csv",
"delimiter=,,header=true");
// Fit scaler -- empty columns means all numeric columns
InvertibleColumnTransform scaler = Scaler.fit(data);
DataFrame scaled = scaler.apply(data);
// All numeric values are now in [0, 1]
System.out.println(scaled.describe());
Example 3: Building a transform pipeline with fit()
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.transform.Transform;
import smile.feature.transform.Standardizer;
import smile.feature.transform.Scaler;
DataFrame data = Read.csv("data/features.csv",
"delimiter=,,header=true");
// Build a pipeline: first standardize, then scale to [0,1]
Transform pipeline = Transform.fit(data,
df -> Standardizer.fit(df),
df -> Scaler.fit(df)
);
// Apply the composed pipeline
DataFrame transformed = pipeline.apply(data);
// Apply to new data (e.g., test set) with same parameters
DataFrame testData = Read.csv("data/test_features.csv",
"delimiter=,,header=true");
DataFrame transformedTest = pipeline.apply(testData);
Example 4: Composing transforms with andThen()
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.transform.Transform;
import smile.feature.transform.Standardizer;
import smile.feature.transform.Normalizer;
DataFrame data = Read.csv("data/text_features.csv",
"delimiter=,,header=true");
String[] numericCols = {"tf_idf_1", "tf_idf_2", "tf_idf_3"};
// Column-wise standardization
Transform standardize = Standardizer.fit(data, numericCols);
// Row-wise L2 normalization
Transform normalize = new Normalizer(Normalizer.Norm.L2, numericCols);
// Compose: standardize first, then normalize each row
Transform composed = standardize.andThen(normalize);
DataFrame result = composed.apply(data);
System.out.println(result.head(5));
Example 5: Using Transform.pipeline() with pre-built transforms
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.transform.Transform;
import smile.feature.transform.Standardizer;
DataFrame trainData = Read.csv("data/train.csv",
"delimiter=,,header=true");
// Fit individual transforms on training data
Transform step1 = Standardizer.fit(trainData, "feature_a", "feature_b");
Transform step2 = Standardizer.fit(
step1.apply(trainData), "feature_c", "feature_d");
// Combine into a single pipeline object
Transform pipeline = Transform.pipeline(step1, step2);
// Serialize and apply to production data
DataFrame production = Read.csv("data/production.csv",
"delimiter=,,header=true");
DataFrame result = pipeline.apply(production);
Example 6: Standardize only specific columns
import smile.io.Read;
import smile.data.DataFrame;
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;
DataFrame data = Read.csv("data/mixed.csv",
"delimiter=,,header=true");
// Standardize only the age and income columns
InvertibleColumnTransform transform =
Standardizer.fit(data, "age", "income");
DataFrame result = transform.apply(data);
// The "age" and "income" columns are standardized;
// all other columns are passed through unchanged
System.out.println(result.head(5));
System.out.println(transform); // prints formula per column
Implementation Details
Transform.fit() -- Sequential Pipeline Fitting
The fit() static method implements a sequential fit-apply loop:
// From Transform.java
static Transform fit(DataFrame data,
Function<DataFrame, Transform>... trainers) {
Transform pipeline = trainers[0].apply(data);
for (int i = 1; i < trainers.length; i++) {
data = pipeline.apply(data);
pipeline = pipeline.andThen(trainers[i].apply(data));
}
return pipeline;
}
Each trainer function receives the data as transformed by all preceding stages, ensuring that statistics (mean, std, min, max) are computed on the correct scale.
ColumnTransform.apply(DataFrame) -- Optimized Batch Transform
The ColumnTransform overrides the default row-wise apply(DataFrame) with an optimized column-wise implementation that processes columns in parallel using IntStream.parallel(). Only columns with registered transform functions are modified; all others are passed through unchanged.
Invertibility
The InvertibleColumnTransform stores a parallel Map<String, Function> of inverse functions. For standardization with mean and standard deviation :
- Forward:
(double x) -> (x - mu) / scale - Inverse:
(double x) -> x * scale + mu
The Normalizer class is not invertible because row-wise normalization is not a column-local operation.
Related Pages
Metadata
| Property | Value |
|---|---|
| Type | API Doc |
| Language | Java |
| Library Version | 5.2.0 |
| Last Updated | 2026-02-08 22:00 GMT |