Principle:Scikit learn Scikit learn Feature Transformation

Overview

A data conditioning process that converts raw features into a normalized, encoded, or imputed form suitable for learning algorithms.

Description

Raw data collected from real-world sources is rarely in a form that machine learning algorithms can consume directly. Feature transformation is the process of converting raw feature values into a representation that satisfies the assumptions and requirements of downstream estimators. There are three core categories of feature transformation:

Imputation (filling missing values): Most scikit-learn estimators do not accept NaN values. Imputation replaces missing entries with a computed statistic -- typically the column mean, median, or most frequent value. The choice of imputation strategy depends on the data distribution and the nature of the missingness. Mean imputation preserves the overall mean of the feature but can underestimate variance. Median imputation is more robust to outliers. The SimpleImputer class provides these strategies.

Scaling (z-score normalization): Many algorithms -- particularly those based on distance metrics (k-NN, SVM) or gradient descent (logistic regression, neural networks) -- assume that features are on a comparable scale. A feature with a range of [0, 1000000] will dominate one with a range of [0, 1] in any distance or gradient computation. Standardization (z-score scaling) transforms each feature to have zero mean and unit variance, placing all features on an equal footing. The StandardScaler class implements this.

Encoding (one-hot encoding): Categorical features stored as strings or integers cannot be fed directly into most numeric algorithms. One-hot encoding converts each categorical feature into a set of binary indicator columns, one per category. This preserves the nominal nature of the data without imposing a false ordinal relationship. The OneHotEncoder class provides this transformation.

Each of these transformations follows the fit/transform pattern: the transformer learns parameters from the training data during fit (e.g., the mean and standard deviation for scaling) and applies those learned parameters during transform. This separation is critical for preventing data leakage -- statistics must be computed only on training data and then applied to both training and test data.

Usage

Feature transformation is applied after column selection and before model fitting. In a typical preprocessing pipeline:

Inspect the data to identify feature types and missing values
Select columns by type using make_column_selector
Apply appropriate transformations: imputation, scaling for numeric columns; imputation, encoding for categorical columns
Combine the transformed columns using ColumnTransformer

Theoretical Basis

Z-score standardization:

The standard score (z-score) of a sample x is defined as:

z = (x - u) / s

where u is the mean of the training samples and s is the standard deviation. After transformation, the feature has mean 0 and standard deviation 1. This is computed independently for each feature.

One-hot encoding:

Given a categorical feature with k distinct categories, one-hot encoding maps each value to a binary vector of length k, where exactly one element is 1 and all others are 0. Formally, for category c_i, the encoded vector e is:

e_j = 1 if j == i, else 0

This representation avoids imposing ordinal relationships between categories.

Mean/median imputation:

For a feature vector x with missing values, mean imputation replaces each missing entry with the arithmetic mean of the observed values:

x_missing = (1/n_observed) * sum(x_observed)

Median imputation replaces missing entries with the median of the observed values, which is more robust when the feature distribution is skewed or contains outliers.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment