Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Rapidsai Cuml Linear Model Fitting

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Linear_Models, Optimization
Last Updated 2026-02-08 12:00 GMT

Overview

Linear model fitting is the process of solving regularized linear optimization problems -- including L1, L2, and ElasticNet penalties combined with squared-error or logistic loss -- using iterative solvers such as coordinate descent, quasi-Newton methods, or stochastic gradient descent.

Description

Linear models form the backbone of many supervised learning tasks. They express the predicted output as a linear combination of input features, optionally transformed through a link function (e.g., the logistic sigmoid for classification). The model parameters (weights) are estimated by minimizing a loss function, typically augmented with a regularization term to prevent overfitting.

Loss Functions:

  • Squared error loss (used by Lasso, ElasticNet, Ridge, SGD regressors): Measures the mean squared difference between predicted and actual values. Suitable for regression tasks.
  • Logistic loss (used by Logistic Regression, SGD classifiers): The negative log-likelihood of the Bernoulli model, suitable for binary and multiclass classification. The logistic function maps the linear predictor to a probability in (0, 1).

Regularization Penalties:

  • L2 (Ridge): Adds the squared L2 norm of the weight vector to the loss. Shrinks all coefficients toward zero but does not produce exact zeros. Controlled by a regularization strength parameter.
  • L1 (Lasso): Adds the L1 norm of the weight vector. Encourages sparsity by driving some coefficients exactly to zero, effectively performing feature selection.
  • ElasticNet: A convex combination of L1 and L2 penalties, controlled by a mixing parameter (l1_ratio). Balances sparsity induction with coefficient stability.

Solvers:

  • Coordinate Descent (CD): Optimizes one coefficient at a time while holding others fixed. Especially efficient for L1-penalized problems because the soft-thresholding update has a closed-form solution. Iterates until convergence.
  • Quasi-Newton (L-BFGS / OWL-QN): Uses approximate second-order curvature information to achieve superlinear convergence. L-BFGS is well-suited for smooth (L2-penalized) problems; the OWL-QN variant handles L1 penalties.
  • Stochastic Gradient Descent (SGD): Updates weights using the gradient computed on a single mini-batch of data at each iteration. Scales well to very large datasets because each iteration touches only a small fraction of the data.

Usage

Linear model fitting is the right choice when:

  • The relationship between features and the target is approximately linear (or can be made so with feature engineering).
  • Interpretability of coefficients is important, as each weight directly quantifies the marginal effect of the corresponding feature.
  • Feature selection is desired (use L1 or ElasticNet regularization).
  • The dataset is large enough that scalability matters, in which case SGD or GPU-accelerated solvers provide significant speedup.
  • A baseline model is needed before exploring more complex nonlinear methods.

For classification tasks, Logistic Regression with L2 or ElasticNet regularization is a strong default. For regression tasks with many correlated features, ElasticNet combines the sparsity of Lasso with the stability of Ridge.

Theoretical Basis

The general regularized linear model objective is:

minw1ni=1n(yi,wTxi)+α[1ρ2w22+ρw1]

where is the loss function, α is the regularization strength, and ρ[0,1] is the L1 ratio (rho=1 gives Lasso, rho=0 gives Ridge, values in between give ElasticNet).

Logistic loss:

(y,y^)=[ylog(σ(y^))+(1y)log(1σ(y^))]

where σ(z)=11+ez is the logistic sigmoid.

Squared error loss:

(y,y^)=12(yy^)2

Coordinate Descent Update (ElasticNet):

For each feature j:
    partial_residual = y - X * w + X_j * w_j
    rho_j = X_j^T * partial_residual / n
    w_j = soft_threshold(rho_j, alpha * l1_ratio) / (1 + alpha * (1 - l1_ratio))

where soft_threshold(z, gamma) = sign(z) * max(|z| - gamma, 0)

SGD Update:

For each mini-batch B:
    g = (1/|B|) * sum_{i in B} grad_L(y_i, w^T x_i) * x_i
    g += alpha * ((1 - l1_ratio) * w + l1_ratio * sign(w))
    w = w - eta * g
    eta is decayed according to a learning rate schedule

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment