Principle:Pyro ppl Pyro Deep Kernel Learning

Knowledge Sources	Deep Kernel Learning Stochastic Variational Deep Kernel Learning Scalable Gaussian Processes with Inducing Points
Domains	Gaussian Processes, Deep Learning, Uncertainty Quantification
Last Updated	2026-02-09 09:00 GMT

Overview

Stochastic variational deep kernel learning combines deep neural networks with Gaussian processes by using a neural network as a learned feature extractor whose outputs are passed to a GP, enabling both the representation power of deep learning and the calibrated uncertainty of GPs.

Description

Standard Gaussian processes use a fixed kernel function (e.g., RBF, Matern) that operates directly on the input space. This limits their ability to capture complex, high-dimensional patterns. Conversely, deep neural networks are powerful feature extractors but provide poorly calibrated uncertainty estimates.

Deep Kernel Learning (DKL) combines the best of both approaches:

A deep neural network g_phi(x) transforms raw inputs x into a learned feature representation.
A Gaussian process with kernel k(g_phi(x), g_phi(x')) operates on the learned features.
Both the neural network parameters phi and the GP hyperparameters are jointly optimized.

The neural network learns a feature space where the GP's simple kernel (e.g., RBF) is effective, while the GP provides principled uncertainty quantification in this learned space.

Stochastic Variational DKL scales this approach to large datasets using:

Inducing points: A small set of pseudo-inputs that summarize the GP, reducing complexity from O(n^3) to O(n * m^2) where m << n is the number of inducing points.
Stochastic optimization: The variational lower bound (ELBO) is optimized using mini-batches of data.
Variational distribution: An approximate posterior over GP function values at inducing points.

This combination enables GP-level uncertainty on datasets with millions of points, making it practical for real-world applications like medical diagnosis, autonomous driving, and active learning.

Usage

Use stochastic variational deep kernel learning when:

You need both the representation power of deep learning and calibrated uncertainty estimates.
Working with high-dimensional inputs (images, text) where standard GP kernels are inadequate.
Building active learning systems that need reliable uncertainty for acquisition functions.
The dataset is too large for standard GP inference but you need uncertainty quantification.
Standard deep learning models are overconfident on out-of-distribution inputs.

Theoretical Basis

Deep kernel:

# Standard GP kernel: k(x, x') = sigma^2 * exp(-|x - x'|^2 / (2*l^2))
# Deep kernel: k_deep(x, x') = k_base(g_phi(x), g_phi(x'))

# where g_phi: R^D -> R^d is a neural network
# D: input dimension (potentially high)
# d: feature dimension (typically low, e.g., 2-10)
# k_base: standard kernel (RBF, Matern, etc.)

# The neural network learns a representation where the
# GP kernel structure is valid and effective

Sparse GP with inducing points:

# Full GP: p(f | X) = Normal(0, K_XX)  with O(n^3) cost
# Sparse GP: introduce inducing points Z = {z_1, ..., z_m}
# u = f(Z): function values at inducing points

# Variational approximation:
# q(f, u) = p(f | u, X, Z) * q(u)
# q(u) = Normal(m, S)  (variational parameters)

# ELBO:
# L = sum_i E_{q(f_i)}[log p(y_i | f_i)] - KL(q(u) || p(u))

# where q(f_i) is obtained from:
# q(f_i) = integral p(f_i | u) q(u) du
#         = Normal(K_{x_i,Z} K_{ZZ}^{-1} m,
#                  K_{x_i,x_i} - K_{x_i,Z} K_{ZZ}^{-1} (K_{ZZ} - S) K_{ZZ}^{-1} K_{Z,x_i})

# Cost: O(n * m^2) instead of O(n^3)

Joint optimization:

# Parameters to optimize:
# phi: neural network weights
# theta: GP hyperparameters (kernel lengthscale, variance, noise)
# Z: inducing point locations
# m, S: variational parameters for q(u)

# Objective: maximize ELBO
# L(phi, theta, Z, m, S) = sum_i E_{q(f_i)}[log p(y_i | f_i)] - KL(q(u) || p(u))

# Stochastic optimization:
# At each step, sample a mini-batch {(x_j, y_j)} and compute:
# L_batch approx (n/|batch|) * sum_{j in batch} E_{q(f_j)}[log p(y_j | f_j)] - KL(q(u) || p(u))

# Gradients flow through:
# y -> loss -> GP -> features -> neural network -> input
# enabling end-to-end training

Predictive distribution:

# For test input x*:
# 1. Compute features: h* = g_phi(x*)
# 2. Compute GP predictive:
#    p(f* | data) = integral p(f* | u) q(u) du
#    = Normal(mu*, sigma*^2)
# 3. Predictive with noise:
#    p(y* | data) = Normal(mu*, sigma*^2 + sigma_noise^2)

# The uncertainty sigma* captures:
# - Epistemic uncertainty (model uncertainty, decreases with data)
# - Aleatoric uncertainty (data noise, irreducible)

Related Pages

Implementation:Pyro_ppl_Pyro_SV_DKL

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment