Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haifengl Smile Transform Pipeline

From Leeroopedia


Overview

The Transform Pipeline implementation in Smile consists of three layers of interfaces and classes in the smile.data.transform and smile.feature.transform packages:

  1. Transform -- The core functional interface extending Function<Tuple, Tuple> with pipeline composition, batch application, and fit methods.
  2. ColumnTransform -- A concrete class that applies per-column transformations via a Map<String, Function>.
  3. InvertibleColumnTransform -- Extends ColumnTransform with inverse transform support for undoing transformations.

Concrete transformers (Standardizer, Scaler, RobustStandardizer, MaxAbsScaler, WinsorScaler, Normalizer) are factory interfaces in the smile.feature.transform package that produce ColumnTransform or InvertibleColumnTransform instances.

API Summary

Class/Interface Method Return Type Description
Transform apply(Tuple) Tuple Transform a single row
Transform apply(DataFrame) DataFrame Transform an entire DataFrame
Transform andThen(Transform) Transform Compose with another transform (this, then after)
Transform compose(Transform) Transform Compose with another transform (before, then this)
Transform fit(DataFrame, Function...) Transform Fit a pipeline of trainers sequentially
Transform pipeline(Transform...) Transform Compose pre-built transforms into a pipeline
InvertibleTransform invert(Tuple) Tuple Inverse transform a single row
InvertibleTransform invert(DataFrame) DataFrame Inverse transform an entire DataFrame
Standardizer fit(DataFrame, String...) InvertibleColumnTransform Fit z-score normalization
Scaler fit(DataFrame, String...) InvertibleColumnTransform Fit min-max scaling to [0, 1]
RobustStandardizer fit(DataFrame, String...) InvertibleColumnTransform Fit median/IQR standardization
MaxAbsScaler fit(DataFrame, String...) InvertibleColumnTransform Fit max-absolute scaling to [-1, 1]
WinsorScaler fit(DataFrame, double, double, String...) InvertibleColumnTransform Fit Winsorized scaling
Normalizer constructor Transform Row-wise L1/L2/L-inf normalization

Source Locations

Class File Lines
Transform base/src/main/java/smile/data/transform/Transform.java L31-97
InvertibleTransform base/src/main/java/smile/data/transform/InvertibleTransform.java L27-41
ColumnTransform base/src/main/java/smile/data/transform/ColumnTransform.java L41-106
InvertibleColumnTransform base/src/main/java/smile/data/transform/InvertibleColumnTransform.java L35-81
Standardizer core/src/main/java/smile/feature/transform/Standardizer.java L37-92
Scaler core/src/main/java/smile/feature/transform/Scaler.java L40-99
Normalizer core/src/main/java/smile/feature/transform/Normalizer.java L36-106
Repository github.com/haifengl/smile

Import

// Core transform interfaces
import smile.data.transform.Transform;
import smile.data.transform.InvertibleTransform;
import smile.data.transform.ColumnTransform;
import smile.data.transform.InvertibleColumnTransform;

// Concrete transformers
import smile.feature.transform.Standardizer;
import smile.feature.transform.Scaler;
import smile.feature.transform.RobustStandardizer;
import smile.feature.transform.MaxAbsScaler;
import smile.feature.transform.WinsorScaler;
import smile.feature.transform.Normalizer;

Type: API Doc

Signature

Transform Interface

public interface Transform extends Function<Tuple, Tuple>, Serializable {

    // Fit a pipeline of transforms from trainer functions
    @SafeVarargs
    static Transform fit(DataFrame data,
        Function<DataFrame, Transform>... trainers)

    // Compose pre-built transforms into a pipeline
    static Transform pipeline(Transform... transforms)

    // Apply transform to entire DataFrame (default: maps over rows)
    default DataFrame apply(DataFrame data)

    // Compose: this transform, then 'after'
    default Transform andThen(Transform after)

    // Compose: 'before' transform, then this
    default Transform compose(Transform before)
}

InvertibleTransform Interface

public interface InvertibleTransform extends Transform {
    Tuple invert(Tuple x)
    DataFrame invert(DataFrame data)
}

ColumnTransform Class

public class ColumnTransform implements Transform {
    public ColumnTransform(String name,
        Map<String, Function> transforms)

    @Override
    public Tuple apply(Tuple x)

    @Override
    public DataFrame apply(DataFrame data)  // optimized column-wise
}

InvertibleColumnTransform Class

public class InvertibleColumnTransform
        extends ColumnTransform
        implements InvertibleTransform {

    public InvertibleColumnTransform(String name,
        Map<String, Function> transforms,
        Map<String, Function> inverses)

    @Override
    public Tuple invert(Tuple x)

    @Override
    public DataFrame invert(DataFrame data)
}

Concrete Transformer Factories

// Z-score normalization: (x - mean) / stdev
public interface Standardizer {
    static InvertibleColumnTransform fit(DataFrame data,
        String... columns)
}

// Min-max scaling to [0, 1]: (x - min) / (max - min)
public interface Scaler {
    static InvertibleColumnTransform fit(DataFrame data,
        String... columns)
}

// Row-wise L1/L2/L-inf normalization
public class Normalizer implements Transform {
    public enum Norm { L1, L2, L_INF }
    public Normalizer(Norm norm, String... columns)
}

Inputs and Outputs

Method Input Output Notes
Transform.fit(data, trainers) DataFrame + trainer functions Composed Transform Each trainer sees data from prior stages
Transform.pipeline(transforms) Pre-built Transform objects Composed Transform No fitting; just composition
transform.apply(data) DataFrame Transformed DataFrame ColumnTransform operates column-wise in parallel
invertible.invert(data) Transformed DataFrame Original-scale DataFrame Only for InvertibleTransform
Standardizer.fit(data, cols) DataFrame + column names InvertibleColumnTransform Computes mean/stdev per column
Scaler.fit(data, cols) DataFrame + column names InvertibleColumnTransform Computes min/max per column

Usage Examples

Example 1: Standardize all numeric columns

import smile.io.Read;
import smile.data.DataFrame;
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Fit standardizer on all numeric columns
InvertibleColumnTransform standardizer =
    Standardizer.fit(iris, "sepal_length", "sepal_width",
        "petal_length", "petal_width");

// Apply to training data
DataFrame standardized = standardizer.apply(iris);
System.out.println(standardized.head(5));

// Invert back to original scale
DataFrame original = standardizer.invert(standardized);
System.out.println(original.head(5));

Example 2: Min-max scaling to [0, 1]

import smile.io.Read;
import smile.data.DataFrame;
import smile.feature.transform.Scaler;
import smile.data.transform.InvertibleColumnTransform;

DataFrame data = Read.csv("data/housing.csv",
    "delimiter=,,header=true");

// Fit scaler -- empty columns means all numeric columns
InvertibleColumnTransform scaler = Scaler.fit(data);

DataFrame scaled = scaler.apply(data);
// All numeric values are now in [0, 1]
System.out.println(scaled.describe());

Example 3: Building a transform pipeline with fit()

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.transform.Transform;
import smile.feature.transform.Standardizer;
import smile.feature.transform.Scaler;

DataFrame data = Read.csv("data/features.csv",
    "delimiter=,,header=true");

// Build a pipeline: first standardize, then scale to [0,1]
Transform pipeline = Transform.fit(data,
    df -> Standardizer.fit(df),
    df -> Scaler.fit(df)
);

// Apply the composed pipeline
DataFrame transformed = pipeline.apply(data);

// Apply to new data (e.g., test set) with same parameters
DataFrame testData = Read.csv("data/test_features.csv",
    "delimiter=,,header=true");
DataFrame transformedTest = pipeline.apply(testData);

Example 4: Composing transforms with andThen()

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.transform.Transform;
import smile.feature.transform.Standardizer;
import smile.feature.transform.Normalizer;

DataFrame data = Read.csv("data/text_features.csv",
    "delimiter=,,header=true");

String[] numericCols = {"tf_idf_1", "tf_idf_2", "tf_idf_3"};

// Column-wise standardization
Transform standardize = Standardizer.fit(data, numericCols);

// Row-wise L2 normalization
Transform normalize = new Normalizer(Normalizer.Norm.L2, numericCols);

// Compose: standardize first, then normalize each row
Transform composed = standardize.andThen(normalize);

DataFrame result = composed.apply(data);
System.out.println(result.head(5));

Example 5: Using Transform.pipeline() with pre-built transforms

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.transform.Transform;
import smile.feature.transform.Standardizer;

DataFrame trainData = Read.csv("data/train.csv",
    "delimiter=,,header=true");

// Fit individual transforms on training data
Transform step1 = Standardizer.fit(trainData, "feature_a", "feature_b");
Transform step2 = Standardizer.fit(
    step1.apply(trainData), "feature_c", "feature_d");

// Combine into a single pipeline object
Transform pipeline = Transform.pipeline(step1, step2);

// Serialize and apply to production data
DataFrame production = Read.csv("data/production.csv",
    "delimiter=,,header=true");
DataFrame result = pipeline.apply(production);

Example 6: Standardize only specific columns

import smile.io.Read;
import smile.data.DataFrame;
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;

DataFrame data = Read.csv("data/mixed.csv",
    "delimiter=,,header=true");

// Standardize only the age and income columns
InvertibleColumnTransform transform =
    Standardizer.fit(data, "age", "income");

DataFrame result = transform.apply(data);

// The "age" and "income" columns are standardized;
// all other columns are passed through unchanged
System.out.println(result.head(5));
System.out.println(transform);  // prints formula per column

Implementation Details

Transform.fit() -- Sequential Pipeline Fitting

The fit() static method implements a sequential fit-apply loop:

// From Transform.java
static Transform fit(DataFrame data,
        Function<DataFrame, Transform>... trainers) {
    Transform pipeline = trainers[0].apply(data);
    for (int i = 1; i < trainers.length; i++) {
        data = pipeline.apply(data);
        pipeline = pipeline.andThen(trainers[i].apply(data));
    }
    return pipeline;
}

Each trainer function receives the data as transformed by all preceding stages, ensuring that statistics (mean, std, min, max) are computed on the correct scale.

ColumnTransform.apply(DataFrame) -- Optimized Batch Transform

The ColumnTransform overrides the default row-wise apply(DataFrame) with an optimized column-wise implementation that processes columns in parallel using IntStream.parallel(). Only columns with registered transform functions are modified; all others are passed through unchanged.

Invertibility

The InvertibleColumnTransform stores a parallel Map<String, Function> of inverse functions. For standardization with mean μ and standard deviation σ:

  • Forward: (double x) -> (x - mu) / scale
  • Inverse: (double x) -> x * scale + mu

The Normalizer class is not invertible because row-wise normalization is not a column-local operation.

Related Pages

Metadata

Property Value
Type API Doc
Language Java
Library Version 5.2.0
Last Updated 2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment