Principle:Zai org CogVideo Diagonal Gaussian Distribution
| Knowledge Sources | |
|---|---|
| Domains | Variational_Inference, Generative_Models, Probability_Theory |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A diagonal Gaussian distribution is a multivariate normal distribution whose covariance matrix is diagonal, meaning all dimensions are independent, used as the standard variational posterior in variational autoencoders.
Description
In variational autoencoders (VAEs), the encoder maps an input to the parameters of a probability distribution in a latent space, rather than to a single deterministic point. The diagonal Gaussian is the most common choice for this distribution because it provides a tractable density with closed-form KL divergence against a standard normal prior, while being flexible enough to capture meaningful latent structure.
The "diagonal" constraint means each latent dimension has its own independent mean and variance, but there are no correlations between dimensions. This reduces the number of parameters from O(d^2) for a full covariance matrix to O(d) for the diagonal case, making it computationally efficient while still enabling rich latent representations.
Usage
Use the diagonal Gaussian distribution as the variational posterior in any VAE-based architecture. It is the standard choice when the encoder needs to output a distribution that supports the reparameterization trick for gradient-based training and provides a closed-form KL divergence term for the evidence lower bound (ELBO) objective.
Theoretical Basis
Parameterization
The encoder outputs a tensor that is split into two halves: the mean vector mu and the log-variance vector log(sigma^2). Using log-variance instead of variance directly ensures numerical stability and allows the network to output values on the full real line:
sigma = exp(0.5 * log_var)
In practice, log-variance is clamped to a safe range (e.g., [-30, 20]) to prevent numerical overflow or underflow.
Reparameterization Trick
To enable gradient-based optimization through the stochastic sampling step, the reparameterization trick expresses a sample as a deterministic function of the distribution parameters and an auxiliary noise variable:
z = mu + sigma * epsilon, where epsilon ~ N(0, I)
This separates the stochasticity (in epsilon) from the learnable parameters (mu and sigma), allowing gradients to flow through the sampling operation via standard backpropagation.
KL Divergence
The KL divergence between the diagonal Gaussian posterior q(z|x) = N(mu, diag(sigma^2)) and a standard normal prior p(z) = N(0, I) has the closed form:
KL(q || p) = -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
This KL term acts as a regularizer in the VAE objective, encouraging the learned posterior to remain close to the prior and preventing the latent space from collapsing to a degenerate distribution.
When computing KL divergence between two arbitrary diagonal Gaussians N(mu_1, sigma_1^2) and N(mu_2, sigma_2^2):
KL(q_1 || q_2) = 0.5 * sum((mu_1 - mu_2)^2 / sigma_2^2 + sigma_1^2 / sigma_2^2 - 1 - log(sigma_1^2) + log(sigma_2^2))
Negative Log-Likelihood
The negative log-likelihood under the diagonal Gaussian is:
NLL = 0.5 * sum(log(2*pi) + log(sigma^2) + (x - mu)^2 / sigma^2)
This is used for reconstruction loss computation when the decoder output is modeled as a Gaussian.