Heuristic:Scikit learn Scikit learn Feature Scaling Numerical Stability
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Numerical_Stability |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Numerical stability techniques used internally by StandardScaler: handling near-constant features, two-pass centering for precision, and proper penalty scaling in solvers.
Description
StandardScaler and related preprocessing transformers handle several numerical edge cases internally. Near-constant features (where scale approaches machine epsilon) are automatically set to scale=1.0 to avoid division by near-zero values. When centering data, a two-pass algorithm is used to correct floating-point precision errors in the mean. Additionally, LogisticRegression solvers must properly scale the penalty term with `n_samples` because the loss function uses a sum (not mean) of pointwise losses.
Usage
Apply this heuristic when encountering UserWarnings about numerical issues during StandardScaler fitting, or when LogisticRegression produces unexpected results with different sample sizes. Relevant to StandardScaler_Init, LogisticRegression_Fit, and Pipeline_Fit_Predict.
The Insight (Rule of Thumb)
- Action: Always scale features before using gradient-based solvers (lbfgs, sag, saga). StandardScaler handles edge cases automatically.
- Value: Near-constant threshold is `10 * eps` where eps is the machine epsilon for the dtype (~1.1e-15 for float64).
- Trade-off: StandardScaler replaces near-zero scales with 1.0 (feature unchanged), which preserves constant features rather than amplifying noise.
- Solver note: The regularization strength `C` in LogisticRegression scales differently than expected: the objective is `C * sum(loss) + penalty`, not `mean(loss) + 1/C * penalty`.
Reasoning
Without the near-constant feature guard, dividing by a very small scale value would amplify floating-point noise into large feature values, causing optimizer instability. The two-pass centering corrects for the loss of precision that occurs when subtracting two nearly equal floating-point numbers (catastrophic cancellation). The penalty scaling note is critical: users often expect `C` to be sample-size invariant, but because sklearn uses `sum` rather than `mean`, the effective regularization strength changes with dataset size.
Code Evidence
Near-constant feature handling from `sklearn/preprocessing/_data.py:99-131`:
# Features with scale close to machine epsilon are set to 1.0
constant_mask = scale < 10 * xp.finfo(scale.dtype).eps
Two-pass centering from `sklearn/preprocessing/_data.py:269-295`:
# If mean centering has precision issues, subtract mean again
# after initial centering to correct floating-point errors
Penalty scaling from `sklearn/linear_model/_logistic.py:309-320`:
# All solvers relying on LinearModelLoss need to scale penalty
# with n_samples because the objective is:
# C * sum(pointwise_loss) + penalty
# NOT:
# mean(pointwise_loss) + 1/C * penalty
sw_sum = n_samples # if sample_weight is None
QuantileTransformer auto-adjustment from `sklearn/preprocessing/_data.py:2884-2888`:
# If n_quantiles > n_samples, automatically set n_quantiles = n_samples
# and issue a warning informing user of this adjustment