Principle:Scikit learn Scikit learn Generalized Linear Models

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Supervised Learning, Statistical Modeling
Last Updated	2026-02-08 15:00 GMT

Overview

Generalized linear models extend ordinary linear regression by allowing the response variable to follow distributions from the exponential family and relating the mean to the linear predictor through a link function.

Description

Generalized Linear Models (GLMs) provide a unified framework for regression when the response variable does not follow a Gaussian distribution. They accommodate count data (Poisson), positive continuous data (Gamma), binary data (Bernoulli/Binomial), and other exponential family distributions. GLMs solve the problem of applying linear modeling principles to response variables that violate the normality assumption of ordinary least squares. They occupy a central role in statistical modeling, bridging classical linear regression with more flexible non-linear approaches.

Usage

Use GLMs when the response variable has a non-Gaussian distribution but a known relationship to the exponential family. Use PoissonRegressor for count data (e.g., number of events, insurance claims). Use GammaRegressor for positive continuous data that is right-skewed (e.g., insurance claim amounts, durations). Use TweedieRegressor when the response has a Tweedie distribution, which encompasses Poisson and Gamma as special cases and is particularly useful for data with exact zeros and a continuous positive component. GLMs are especially important in actuarial science, healthcare, and ecology.

Theoretical Basis

A GLM consists of three components:

Random component: The response variable $y$ follows a distribution from the exponential family:
$p (y | θ, ϕ) = \exp (\frac{y θ - b (θ)}{a (ϕ)} + c (y, ϕ))$

where $θ$ is the natural parameter, $ϕ$ is the dispersion parameter, and $b (θ)$ is the cumulant function.

Systematic component: A linear predictor $η = X β$ .

Link function: A monotonic function $g$ relating the conditional mean $μ = E [y | X]$ to the linear predictor: $g (μ) = η$ .

Common GLM families and their canonical link functions:

Distribution	Link Function	$g (μ)$
Gaussian	Identity	$μ$
Poisson	Log	$\log (μ)$
Gamma	Reciprocal	$1 / μ$
Bernoulli	Logit	$\log (μ / (1 - μ))$

Parameter estimation is performed by maximizing the log-likelihood, typically via Iteratively Reweighted Least Squares (IRLS) or Newton's method. With an $ℓ_{2}$ penalty, the objective becomes:

$\hat{β} = \arg \min_{β} - \frac{1}{n} \sum_{i = 1}^{n} \log p (y_{i} | x_{i}, β) + \frac{α}{2} ‖ β ‖_{2}^{2}$

The Tweedie distribution is a special case of the exponential family parameterized by a power parameter $p$ :

$p = 0$ : Gaussian
$p = 1$ : Poisson
$1 < p < 2$ : compound Poisson-Gamma (zero-inflated continuous)
$p = 2$ : Gamma
$p = 3$ : Inverse Gaussian

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment