Principle:Scikit learn Scikit learn Generalized Linear Models
| Knowledge Sources | |
|---|---|
| Domains | Supervised Learning, Statistical Modeling |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Generalized linear models extend ordinary linear regression by allowing the response variable to follow distributions from the exponential family and relating the mean to the linear predictor through a link function.
Description
Generalized Linear Models (GLMs) provide a unified framework for regression when the response variable does not follow a Gaussian distribution. They accommodate count data (Poisson), positive continuous data (Gamma), binary data (Bernoulli/Binomial), and other exponential family distributions. GLMs solve the problem of applying linear modeling principles to response variables that violate the normality assumption of ordinary least squares. They occupy a central role in statistical modeling, bridging classical linear regression with more flexible non-linear approaches.
Usage
Use GLMs when the response variable has a non-Gaussian distribution but a known relationship to the exponential family. Use PoissonRegressor for count data (e.g., number of events, insurance claims). Use GammaRegressor for positive continuous data that is right-skewed (e.g., insurance claim amounts, durations). Use TweedieRegressor when the response has a Tweedie distribution, which encompasses Poisson and Gamma as special cases and is particularly useful for data with exact zeros and a continuous positive component. GLMs are especially important in actuarial science, healthcare, and ecology.
Theoretical Basis
A GLM consists of three components:
- Random component: The response variable follows a distribution from the exponential family:
- where is the natural parameter, is the dispersion parameter, and is the cumulant function.
- Systematic component: A linear predictor .
- Link function: A monotonic function relating the conditional mean to the linear predictor: .
Common GLM families and their canonical link functions:
| Distribution | Link Function | |
|---|---|---|
| Gaussian | Identity | |
| Poisson | Log | |
| Gamma | Reciprocal | |
| Bernoulli | Logit |
Parameter estimation is performed by maximizing the log-likelihood, typically via Iteratively Reweighted Least Squares (IRLS) or Newton's method. With an penalty, the objective becomes:
The Tweedie distribution is a special case of the exponential family parameterized by a power parameter :
- : Gaussian
- : Poisson
- : compound Poisson-Gamma (zero-inflated continuous)
- : Gamma
- : Inverse Gaussian