Principle:Online ml River Online Preprocessing
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Feature_Engineering |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Online preprocessing transforms raw input features into representations suitable for machine learning models, doing so incrementally as each observation arrives. Unlike batch preprocessing where statistics (mean, variance, category sets) are computed over the full dataset, online preprocessors maintain running estimates that evolve with the stream.
This is essential in streaming environments where the full feature distribution is unknown at the start and may change over time (concept drift, feature drift, or schema evolution).
Theoretical Basis
Online Scaling and Normalization
Many models assume features are on comparable scales. Online scalers maintain running statistics:
- Standard scaling: Centers to mean zero and unit variance using Welford's online algorithm for mean and variance.
- Min-max scaling: Tracks running minimum and maximum, scaling to [0, 1]. Sensitive to outliers in the stream.
- Target scaling: Normalizes the target variable rather than features, useful when the target distribution shifts.
Categorical Encoding
Categorical features must be converted to numerical representations:
- One-hot encoding: Creates a binary column per category. In the online setting, new categories can appear at any time, requiring dynamic expansion of the encoding.
- Ordinal encoding: Assigns an integer to each category, maintaining a mapping that grows as new categories appear.
- Feature hashing (hashing trick): Maps feature names to a fixed-size vector via a hash function. Avoids maintaining an explicit vocabulary but introduces collisions.
Dimensionality Reduction
- Online LDA (Latent Dirichlet Allocation): Incrementally learns topic distributions from streaming text data.
- Random projection: Projects high-dimensional data to a lower-dimensional space using a random matrix, preserving pairwise distances (Johnson-Lindenstrauss lemma). The projection matrix is fixed, making this naturally online.
Missing Value Imputation
Online imputers replace missing values using running statistics (mean, median, mode) that update with each non-missing observation.
Prediction Clipping
Clips model predictions to a specified range, ensuring outputs remain within valid bounds (e.g., probabilities in [0, 1]).
Related Pages
- Implementation:Online_ml_River_Preprocessing_FeatureHasher
- Implementation:Online_ml_River_Preprocessing_Imputers
- Implementation:Online_ml_River_Preprocessing_LDA
- Implementation:Online_ml_River_Preprocessing_OneHotEncoder
- Implementation:Online_ml_River_Preprocessing_OrdinalEncoder
- Implementation:Online_ml_River_Preprocessing_PredClipper
- Implementation:Online_ml_River_Preprocessing_RandomProjection
- Implementation:Online_ml_River_Preprocessing_TargetScalers