Principle:Online ml River Online Logistic Regression

Knowledge Sources	River River Docs
Domains	Online_Learning Classification Optimization
Last Updated	2026-02-08 16:00 GMT

Overview

Online logistic regression is a binary classification algorithm that learns a linear decision boundary by performing stochastic gradient descent on the log-loss function, updating weights one observation at a time.

Description

Logistic regression is one of the most widely used algorithms for binary classification. It models the probability of the positive class as the sigmoid (logistic) function of a linear combination of features. In the online setting, the model processes one observation at a time, computing the gradient of the loss function for that single sample and updating the weight vector accordingly. This makes it a stochastic gradient descent (SGD) approach to logistic regression.

River's implementation builds on a Generalized Linear Model (GLM) base class that handles the core SGD mechanics: computing the raw dot product, evaluating loss gradients, clipping gradients, and applying weight updates via a pluggable optimizer. The LogisticRegression class specializes this by using the log-loss (binary cross-entropy) as the loss function and the sigmoid function as the mean function to map raw scores to probabilities.

The algorithm supports:

L1 regularization: Encourages sparse weight vectors by penalizing the absolute value of weights (uses a cumulative penalty approach for online L1).
L2 regularization: Encourages small weight vectors by penalizing the squared magnitude of weights.
Pluggable optimizers: Any optimizer from River's optim module (SGD, Adam, AdaGrad, etc.) can be used.
Gradient clipping: Prevents exploding gradients by clamping gradient values to a maximum absolute value.

Usage

Use online logistic regression when:

You need a binary classifier that can learn incrementally from streaming data.
You want an interpretable model with a linear decision boundary.
You need probabilistic predictions (class probabilities, not just labels).
You want to combine it with feature scaling in a pipeline for best results.

Theoretical Basis

Model: The probability of the positive class given features $x$ is:

p(y=1 | x) = sigmoid(w . x + b)

where $sigmoid (z) = 1 / (1 + \exp (- z))$ , $w$ is the weight vector, and $b$ is the intercept (bias).

Loss function (log-loss / binary cross-entropy):

L(y, p) = -y * log(p) - (1 - y) * log(1 - p)

Gradient computation: For a single observation $(x, y)$ :

gradient_w = (sigmoid(w . x + b) - y) * x
gradient_b = sigmoid(w . x + b) - y

Weight update (SGD):

w_new = w_old - learning_rate * gradient_w
b_new = b_old - intercept_lr * gradient_b

With L2 regularization:

gradient_w = (sigmoid(w . x + b) - y) * x + l2 * w

With L1 regularization (cumulative penalty):

The online L1 penalty uses a cumulative approach where a running maximum cumulative L1 penalty is maintained. After each weight update, the penalty is applied:

if w_j > 0:
    w_j = max(0, w_j - (max_cum_l1 + cum_l1_j))
elif w_j < 0:
    w_j = min(0, w_j + (max_cum_l1 - cum_l1_j))

Prediction: The predict_proba_one method returns a dictionary mapping each class label to its predicted probability: {False: 1-p, True: p}. The predict_one method returns the class with the highest probability.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment