Principle:Online ml River Online Logistic Regression
| Knowledge Sources | River River Docs |
|---|---|
| Domains | Online_Learning Classification Optimization |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Online logistic regression is a binary classification algorithm that learns a linear decision boundary by performing stochastic gradient descent on the log-loss function, updating weights one observation at a time.
Description
Logistic regression is one of the most widely used algorithms for binary classification. It models the probability of the positive class as the sigmoid (logistic) function of a linear combination of features. In the online setting, the model processes one observation at a time, computing the gradient of the loss function for that single sample and updating the weight vector accordingly. This makes it a stochastic gradient descent (SGD) approach to logistic regression.
River's implementation builds on a Generalized Linear Model (GLM) base class that handles the core SGD mechanics: computing the raw dot product, evaluating loss gradients, clipping gradients, and applying weight updates via a pluggable optimizer. The LogisticRegression class specializes this by using the log-loss (binary cross-entropy) as the loss function and the sigmoid function as the mean function to map raw scores to probabilities.
The algorithm supports:
- L1 regularization: Encourages sparse weight vectors by penalizing the absolute value of weights (uses a cumulative penalty approach for online L1).
- L2 regularization: Encourages small weight vectors by penalizing the squared magnitude of weights.
- Pluggable optimizers: Any optimizer from River's
optimmodule (SGD, Adam, AdaGrad, etc.) can be used. - Gradient clipping: Prevents exploding gradients by clamping gradient values to a maximum absolute value.
Usage
Use online logistic regression when:
- You need a binary classifier that can learn incrementally from streaming data.
- You want an interpretable model with a linear decision boundary.
- You need probabilistic predictions (class probabilities, not just labels).
- You want to combine it with feature scaling in a pipeline for best results.
Theoretical Basis
Model: The probability of the positive class given features is:
p(y=1 | x) = sigmoid(w . x + b)
where , is the weight vector, and is the intercept (bias).
Loss function (log-loss / binary cross-entropy):
L(y, p) = -y * log(p) - (1 - y) * log(1 - p)
Gradient computation: For a single observation :
gradient_w = (sigmoid(w . x + b) - y) * x
gradient_b = sigmoid(w . x + b) - y
Weight update (SGD):
w_new = w_old - learning_rate * gradient_w
b_new = b_old - intercept_lr * gradient_b
With L2 regularization:
gradient_w = (sigmoid(w . x + b) - y) * x + l2 * w
With L1 regularization (cumulative penalty):
The online L1 penalty uses a cumulative approach where a running maximum cumulative L1 penalty is maintained. After each weight update, the penalty is applied:
if w_j > 0:
w_j = max(0, w_j - (max_cum_l1 + cum_l1_j))
elif w_j < 0:
w_j = min(0, w_j + (max_cum_l1 - cum_l1_j))
Prediction: The predict_proba_one method returns a dictionary mapping each class label to its predicted probability: {False: 1-p, True: p}. The predict_one method returns the class with the highest probability.