Implementation:Scikit learn Scikit learn StandardScaler Init
Overview
Concrete tool for standardizing features by removing the mean and scaling to unit variance provided by scikit-learn.
Code Reference
Class: StandardScaler
Module: sklearn/preprocessing/_data.py (lines 740-884)
Inheritance: OneToOneFeatureMixin, TransformerMixin, BaseEstimator
Constructor signature:
class StandardScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
def __init__(self, *, copy=True, with_mean=True, with_std=True):
The standard score of a sample x is calculated as:
z = (x - u) / s
where u is the mean of the training samples (or zero if with_mean=False) and s is the standard deviation (or one if with_std=False).
I/O Contract
Constructor parameters:
copy:bool, default=True-- IfFalse, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g., if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.with_mean:bool, default=True-- IfTrue, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix.with_std:bool, default=True-- IfTrue, scale the data to unit variance (or equivalently, unit standard deviation).
Fitted attributes:
scale_:ndarrayof shape(n_features,)orNone-- Per-feature relative scaling factor, generallynp.sqrt(var_). Equal toNonewhenwith_std=False.mean_:ndarrayof shape(n_features,)orNone-- The mean value for each feature in the training set. Equal toNonewhenwith_mean=Falseandwith_std=False.var_:ndarrayof shape(n_features,)orNone-- The variance for each feature in the training set.n_features_in_:int-- Number of features seen duringfit.feature_names_in_:ndarrayof shape(n_features_in_,)-- Names of features seen duringfit. Defined only whenXhas feature names that are all strings.n_samples_seen_:intorndarrayof shape(n_features,)-- The number of samples processed by the estimator for each feature.
Companion Transformers
StandardScaler is typically used alongside two other transformers in a preprocessing pipeline:
SimpleImputer(sklearn.impute.SimpleImputer): Fills missing values before scaling. Common strategies include"mean","median", and"most_frequent". In a numeric pipeline,SimpleImputeris placed beforeStandardScalerto ensure the scaler receives complete data.
OneHotEncoder(sklearn.preprocessing.OneHotEncoder): Converts categorical features into binary indicator columns. While not applied to the same columns asStandardScaler, these two transformers are the canonical pair in aColumnTransformer-- one handling the numeric branch and the other handling the categorical branch.
Implementation Details
StandardScaler computes centering and scaling independently on each feature during fit. Mean and standard deviation are stored and reused during transform. Key implementation notes:
- NaNs are treated as missing values: they are disregarded in
fitand maintained (passed through) intransform. - The standard deviation uses a biased estimator, equivalent to
numpy.std(x, ddof=0). - If a feature has zero variance, the scaling factor is set to 1 (data is left as-is).
- The scaler supports
partial_fitfor incremental learning on large datasets. - Sparse CSR and CSC matrices are supported when
with_mean=False.
Usage Examples
from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
scaler.fit(data)
print(scaler.mean_) # [0.5 0.5]
print(scaler.transform(data))
# [[-1. -1.]
# [-1. -1.]
# [ 1. 1.]
# [ 1. 1.]]
print(scaler.transform([[2, 2]]))
# [[3. 3.]]
Combined with SimpleImputer in a pipeline:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
numeric_pipeline = make_pipeline(
SimpleImputer(strategy="mean"),
StandardScaler()
)