Principle:Scikit learn Scikit learn Kernel Methods

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Supervised Learning, Non-Parametric Methods
Last Updated	2026-02-08 15:00 GMT

Overview

Kernel methods map data into high-dimensional feature spaces via kernel functions and perform linear algorithms in those spaces, enabling non-linear models with strong theoretical foundations.

Description

Kernel methods extend linear algorithms to handle non-linear relationships by implicitly operating in a high-dimensional (or even infinite-dimensional) feature space defined by a kernel function. The kernel trick avoids explicit computation of the high-dimensional feature vectors, instead computing inner products directly via the kernel function. This enables linear methods like ridge regression, PCA, and SVMs to learn non-linear patterns while retaining their theoretical properties. Kernel approximation and random projection methods provide scalable alternatives that explicitly construct approximate low-dimensional feature maps, enabling kernel-like performance with linear algorithms at reduced computational cost.

Usage

Use Kernel Ridge Regression when you want a non-linear regression model that combines the kernel trick with ridge regularization, particularly when exact kernel evaluation is feasible (small to moderate datasets). Use kernel approximation methods (Nystroem, Random Fourier Features) when the dataset is too large for exact kernel methods but you want to approximate kernel-based learning using explicit feature maps with linear models. Use random projection methods (Gaussian random projection, sparse random projection) when you need fast, data-independent dimensionality reduction that approximately preserves pairwise distances, as guaranteed by the Johnson-Lindenstrauss lemma.

Theoretical Basis

Kernel Trick: A kernel function $k (x, x^{'})$ computes the inner product in a feature space without explicit transformation:

$k (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'}) ⟩_{ℋ}$

where $ϕ : 𝒳 \to ℋ$ maps inputs to a (potentially infinite-dimensional) Hilbert space. Mercer's theorem ensures that any positive semi-definite kernel corresponds to such a mapping.

Kernel Ridge Regression: Combines the kernel trick with ridge regression:

$\hat{f} (x) = k (x, X) (K + α I)^{- 1} y$

where $K_{i j} = k (x_{i}, x_{j})$ is the kernel matrix and $α$ is the regularization parameter. The dual formulation has complexity $O (n^{3})$ for $n$ samples, independent of the (possibly infinite) feature space dimensionality.

Kernel Approximation: Explicit feature maps $\hat{ϕ}$ approximate the kernel:

$k (x, x^{'}) \approx \hat{ϕ} (x)^{T} \hat{ϕ} (x^{'})$

Random Fourier Features (RBF Sampler): For shift-invariant kernels, Bochner's theorem guarantees:

$k (x - x^{'}) = \int e^{i ω^{T} (x - x^{'})} p (ω) d ω \approx \frac{1}{D} \sum_{j = 1}^{D} \cos (ω_{j}^{T} x + b_{j}) \cos (ω_{j}^{T} x^{'} + b_{j})$

where $ω_{j} \sim p (ω)$ and $b_{j} \sim Uniform (0, 2 π)$ . For the RBF kernel, $p (ω) = 𝒩 (0, 2 γ I)$ .

Nystroem Approximation: Approximates the kernel matrix using a subset of $m$ landmark points:

$K \approx C W^{- 1} C^{T}$

where $C$ is the $n \times m$ kernel matrix between all points and the landmarks, and $W$ is the $m \times m$ kernel matrix among landmarks.

Random Projection: Projects data from $ℝ^{d}$ to $ℝ^{k}$ using a random matrix $R$ :

$x^{'} = \frac{1}{\sqrt{k}} R x$

The Johnson-Lindenstrauss Lemma guarantees that for $k = O (ε^{- 2} \log n)$ , pairwise distances are preserved within a factor of $(1 \pm ε)$ with high probability:

$(1 - ε) ‖ x_{i} - x_{j} ‖^{2} \leq ‖ {x_{i}}^{'} - {x_{j}}^{'} ‖^{2} \leq (1 + ε) ‖ x_{i} - x_{j} ‖^{2}$

Gaussian random projection uses $R_{i j} \sim 𝒩 (0, 1)$ . Sparse random projection uses a sparse matrix with entries ${- 1, 0, + 1}$ with appropriate probabilities, providing computational savings while maintaining the JL guarantee.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment