Principle:Scikit learn Scikit learn Density Estimation

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Unsupervised Learning, Probability Theory
Last Updated	2026-02-08 15:00 GMT

Overview

Density estimation infers the underlying probability distribution of a dataset, enabling assessment of how likely new observations are under the learned distribution.

Description

Density estimation methods construct an approximation of the probability density function from observed data. They solve the fundamental problem of characterizing the distribution of data without assuming a specific parametric form (non-parametric methods) or by fitting a flexible mixture of parametric components (semi-parametric methods). Density estimation underpins anomaly detection (low-density observations are anomalous), generative modeling (sampling from the estimated density), clustering (mixture model components correspond to clusters), and statistical testing. It sits at the core of probabilistic machine learning.

Usage

Use Kernel Density Estimation (KDE) when a non-parametric estimate of the density is needed and the data is low-to-moderate dimensional. Use Gaussian Mixture Models (GMMs) when the data is believed to arise from a mixture of several Gaussian components, and when both cluster assignments and density estimates are desired. Use Bayesian Gaussian Mixture Models when the number of mixture components is uncertain and should be inferred from the data, or when a Bayesian treatment of uncertainty is preferred. KDE is well-suited for visualization and one-dimensional density estimation; GMMs scale better to moderate dimensions and naturally integrate with clustering workflows.

Theoretical Basis

Kernel Density Estimation (KDE) estimates the density at point $x$ as:

$\hat{f} (x) = \frac{1}{n h^{d}} \sum_{i = 1}^{n} K (\frac{x - x_{i}}{h})$

where $K$ is a kernel function (typically Gaussian), $h$ is the bandwidth, $n$ is the number of samples, and $d$ is the dimensionality. The bandwidth $h$ controls the smoothness of the estimate: too small produces a noisy estimate, too large oversmooths.

Common kernels include:

Gaussian: $K (u) = \frac{1}{(2 π)^{d / 2}} \exp (- \frac{1}{2} ‖ u ‖^{2})$
Tophat: $K (u) = 𝟏 (‖ u ‖ \leq 1)$
Epanechnikov: $K (u) = \frac{3}{4} (1 - u^{2}) 𝟏 (| u | \leq 1)$

Gaussian Mixture Model (GMM) models the density as a weighted sum of Gaussians:

$p (x) = \sum_{k = 1}^{K} π_{k} 𝒩 (x | μ_{k}, Σ_{k})$

where $π_{k}$ are mixing weights ( $\sum_{k} π_{k} = 1$ ), and $μ_{k}, Σ_{k}$ are the mean and covariance of each component.

Parameters are estimated via the Expectation-Maximization (EM) algorithm:

E-step: Compute responsibilities $γ_{i k} = \frac{π_{k} 𝒩 (x_{i} | μ_{k}, Σ_{k})}{\sum_{j} π_{j} 𝒩 (x_{i} | μ_{j}, Σ_{j})}$
M-step: Update parameters:
$μ_{k} = \frac{\sum_{i} γ_{i k} x_{i}}{\sum_{i} γ_{i k}}$

$Σ_{k} = \frac{\sum_{i} γ_{i k} (x_{i} - μ_{k}) (x_{i} - μ_{k})^{T}}{\sum_{i} γ_{i k}}$

$π_{k} = \frac{1}{n} \sum_{i} γ_{i k}$

Bayesian Gaussian Mixture Model places priors on mixture parameters (Dirichlet prior on weights, Gaussian-Wishart prior on means and covariances). Using variational inference, it can automatically determine the effective number of components by driving unnecessary component weights toward zero.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment