Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River Online Preprocessing

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Feature_Engineering
Last Updated 2026-02-08 18:00 GMT

Overview

Online preprocessing transforms raw input features into representations suitable for machine learning models, doing so incrementally as each observation arrives. Unlike batch preprocessing where statistics (mean, variance, category sets) are computed over the full dataset, online preprocessors maintain running estimates that evolve with the stream.

This is essential in streaming environments where the full feature distribution is unknown at the start and may change over time (concept drift, feature drift, or schema evolution).

Theoretical Basis

Online Scaling and Normalization

Many models assume features are on comparable scales. Online scalers maintain running statistics:

  • Standard scaling: Centers to mean zero and unit variance using Welford's online algorithm for mean and variance.
  • Min-max scaling: Tracks running minimum and maximum, scaling to [0, 1]. Sensitive to outliers in the stream.
  • Target scaling: Normalizes the target variable rather than features, useful when the target distribution shifts.

Categorical Encoding

Categorical features must be converted to numerical representations:

  • One-hot encoding: Creates a binary column per category. In the online setting, new categories can appear at any time, requiring dynamic expansion of the encoding.
  • Ordinal encoding: Assigns an integer to each category, maintaining a mapping that grows as new categories appear.
  • Feature hashing (hashing trick): Maps feature names to a fixed-size vector via a hash function. Avoids maintaining an explicit vocabulary but introduces collisions.

Dimensionality Reduction

  • Online LDA (Latent Dirichlet Allocation): Incrementally learns topic distributions from streaming text data.
  • Random projection: Projects high-dimensional data to a lower-dimensional space using a random matrix, preserving pairwise distances (Johnson-Lindenstrauss lemma). The projection matrix is fixed, making this naturally online.

Missing Value Imputation

Online imputers replace missing values using running statistics (mean, median, mode) that update with each non-missing observation.

Prediction Clipping

Clips model predictions to a specified range, ensuring outputs remain within valid bounds (e.g., probabilities in [0, 1]).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment