Principle:Scikit learn Scikit learn Gaussian Process

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Supervised Learning, Bayesian Inference
Last Updated	2026-02-08 15:00 GMT

Overview

Gaussian processes define a distribution over functions, providing a non-parametric Bayesian approach to regression and classification with built-in uncertainty quantification.

Description

A Gaussian Process (GP) is a collection of random variables, any finite number of which follow a joint Gaussian distribution. GPs are fully specified by a mean function and a covariance (kernel) function, which encodes assumptions about the function being modeled (smoothness, periodicity, length scale). They solve the problem of making predictions with well-calibrated uncertainty estimates without committing to a fixed parametric form. GPs are particularly valuable in settings where uncertainty quantification is critical, such as Bayesian optimization, active learning, and safety-critical applications.

Usage

Use Gaussian Process Regression (GPR) when you need both predictions and uncertainty estimates, when the dataset is small to moderate in size (GPs scale as $O (n^{3})$ ), and when the function is expected to be smooth. Use Gaussian Process Classification (GPC) for probabilistic classification with uncertainty estimates. The choice of kernel is critical: use the RBF kernel for smooth functions, the Matern kernel for functions with varying smoothness, periodic kernels for periodic patterns, and composite kernels (sums and products) for complex structure. GPs are not suitable for very large datasets without approximation methods.

Theoretical Basis

Gaussian Process is defined as:

$f (x) \sim 𝒢 𝒫 (m (x), k (x, x^{'}))$

where $m (x)$ is the mean function (often zero) and $k (x, x^{'})$ is the covariance (kernel) function.

Gaussian Process Regression: Given training data $(X, y)$ with noise model $y = f (x) + ε$ , $ε \sim 𝒩 (0, σ_{n}^{2})$ , the predictive distribution at test points $X_{*}$ is:

$f_{*} | X, y, X_{*} \sim 𝒩 ({\bar{f}}_{*}, cov (f_{*}))$

where: ${\bar{f}}_{*} = K (X_{*}, X) [K (X, X) + σ_{n}^{2} I]^{- 1} y$ $cov (f_{*}) = K (X_{*}, X_{*}) - K (X_{*}, X) [K (X, X) + σ_{n}^{2} I]^{- 1} K (X, X_{*})$

Gaussian Process Classification: For binary classification, the latent function is passed through a sigmoid (or probit) link function:

$p (y = 1 | x) = σ (f (x))$

Since the posterior over $f$ is no longer Gaussian, approximate inference is needed (Laplace approximation or Expectation Propagation).

Common Kernel Functions:

RBF (Squared Exponential): $k (x, x^{'}) = σ_{f}^{2} \exp (- \frac{‖ x - x^{'} ‖^{2}}{2 ℓ^{2}})$

Matern kernel: $k (x, x^{'}) = σ_{f}^{2} \frac{2^{1 - ν}}{Γ (ν)} {(\frac{\sqrt{2 ν} ‖ x - x^{'} ‖}{ℓ})}^{ν} K_{ν} (\frac{\sqrt{2 ν} ‖ x - x^{'} ‖}{ℓ})$

where $K_{ν}$ is the modified Bessel function. The parameter $ν$ controls smoothness.

Rational Quadratic: $k (x, x^{'}) = σ_{f}^{2} {(1 + \frac{‖ x - x^{'} ‖^{2}}{2 α ℓ^{2}})}^{- α}$

Hyperparameter optimization is performed by maximizing the log marginal likelihood:

$\log p (y | X, θ) = - \frac{1}{2} y^{T} (K + σ_{n}^{2} I)^{- 1} y - \frac{1}{2} \log | K + σ_{n}^{2} I | - \frac{n}{2} \log 2 π$

This provides a principled, Bayesian approach to model selection that automatically balances data fit and model complexity (Occam's razor).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment