Principle:Rapidsai Cuml Linear Model Fitting

Knowledge Sources	Friedman et al. 2010 - Regularization Paths for Generalized Linear Models via Coordinate Descent Byrd et al. 1995 - A Limited Memory Algorithm for Bound Constrained Optimization (L-BFGS-B) Bottou 2010 - Large-Scale Machine Learning with Stochastic Gradient Descent
Domains	Machine_Learning, Linear_Models, Optimization
Last Updated	2026-02-08 12:00 GMT

Overview

Linear model fitting is the process of solving regularized linear optimization problems -- including L1, L2, and ElasticNet penalties combined with squared-error or logistic loss -- using iterative solvers such as coordinate descent, quasi-Newton methods, or stochastic gradient descent.

Description

Linear models form the backbone of many supervised learning tasks. They express the predicted output as a linear combination of input features, optionally transformed through a link function (e.g., the logistic sigmoid for classification). The model parameters (weights) are estimated by minimizing a loss function, typically augmented with a regularization term to prevent overfitting.

Loss Functions:

Squared error loss (used by Lasso, ElasticNet, Ridge, SGD regressors): Measures the mean squared difference between predicted and actual values. Suitable for regression tasks.
Logistic loss (used by Logistic Regression, SGD classifiers): The negative log-likelihood of the Bernoulli model, suitable for binary and multiclass classification. The logistic function maps the linear predictor to a probability in (0, 1).

Regularization Penalties:

L2 (Ridge): Adds the squared L2 norm of the weight vector to the loss. Shrinks all coefficients toward zero but does not produce exact zeros. Controlled by a regularization strength parameter.
L1 (Lasso): Adds the L1 norm of the weight vector. Encourages sparsity by driving some coefficients exactly to zero, effectively performing feature selection.
ElasticNet: A convex combination of L1 and L2 penalties, controlled by a mixing parameter (l1_ratio). Balances sparsity induction with coefficient stability.

Solvers:

Coordinate Descent (CD): Optimizes one coefficient at a time while holding others fixed. Especially efficient for L1-penalized problems because the soft-thresholding update has a closed-form solution. Iterates until convergence.
Quasi-Newton (L-BFGS / OWL-QN): Uses approximate second-order curvature information to achieve superlinear convergence. L-BFGS is well-suited for smooth (L2-penalized) problems; the OWL-QN variant handles L1 penalties.
Stochastic Gradient Descent (SGD): Updates weights using the gradient computed on a single mini-batch of data at each iteration. Scales well to very large datasets because each iteration touches only a small fraction of the data.

Usage

Linear model fitting is the right choice when:

The relationship between features and the target is approximately linear (or can be made so with feature engineering).
Interpretability of coefficients is important, as each weight directly quantifies the marginal effect of the corresponding feature.
Feature selection is desired (use L1 or ElasticNet regularization).
The dataset is large enough that scalability matters, in which case SGD or GPU-accelerated solvers provide significant speedup.
A baseline model is needed before exploring more complex nonlinear methods.

For classification tasks, Logistic Regression with L2 or ElasticNet regularization is a strong default. For regression tasks with many correlated features, ElasticNet combines the sparsity of Lasso with the stability of Ridge.

Theoretical Basis

The general regularized linear model objective is:

$\min_{w} \frac{1}{n} \sum_{i = 1}^{n} ℒ (y_{i}, w^{T} x_{i}) + α [\frac{1 - ρ}{2} ‖ w ‖_{2}^{2} + ρ ‖ w ‖_{1}]$

where $ℒ$ is the loss function, $α$ is the regularization strength, and $ρ \in [0, 1]$ is the L1 ratio (rho=1 gives Lasso, rho=0 gives Ridge, values in between give ElasticNet).

Logistic loss:

$ℒ (y, \hat{y}) = - [y \log (σ (\hat{y})) + (1 - y) \log (1 - σ (\hat{y}))]$

where $σ (z) = \frac{1}{1 + e^{- z}}$ is the logistic sigmoid.

Squared error loss:

$ℒ (y, \hat{y}) = \frac{1}{2} (y - \hat{y})^{2}$

Coordinate Descent Update (ElasticNet):

For each feature j:
    partial_residual = y - X * w + X_j * w_j
    rho_j = X_j^T * partial_residual / n
    w_j = soft_threshold(rho_j, alpha * l1_ratio) / (1 + alpha * (1 - l1_ratio))

where soft_threshold(z, gamma) = sign(z) * max(|z| - gamma, 0)

SGD Update:

For each mini-batch B:
    g = (1/|B|) * sum_{i in B} grad_L(y_i, w^T x_i) * x_i
    g += alpha * ((1 - l1_ratio) * w + l1_ratio * sign(w))
    w = w - eta * g
    eta is decayed according to a learning rate schedule

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment