Principle:Pyro ppl Pyro Sparse Deep Exponential Family

Knowledge Sources	Deep Exponential Families Sparse Gamma Deep Exponential Family Black-Box Variational Inference
Domains	Deep Generative Models, Exponential Families, Sparse Modeling
Last Updated	2026-02-09 09:00 GMT

Overview

The Sparse Gamma Deep Exponential Family is a hierarchical generative model with multiple layers of Gamma-distributed latent variables connected by sparse non-negative weight matrices, enabling interpretable, non-negative feature discovery.

Description

Deep Exponential Families (DEFs) are a class of deep generative models where each layer consists of latent variables from an exponential family distribution, and the natural parameters of each layer are a function of the layer above. DEFs generalize both deep neural networks (by using probabilistic layers) and classical factor models (by adding depth).

The Sparse Gamma DEF specifically uses:

Gamma-distributed latent variables: Each layer contains non-negative latent variables drawn from Gamma distributions. The non-negativity is important for interpretability (e.g., non-negative factor loadings, topic intensities).

Sparse weight matrices: The connections between layers use sparse non-negative weight matrices, where sparsity is induced through appropriate priors (e.g., sparse Gamma or spike-and-slab priors on the weights). Sparsity improves interpretability and prevents overfitting.

Multiple layers: Deeper layers capture increasingly abstract features. Layer 1 might represent basic features (word counts), layer 2 captures mid-level patterns (topics), and layer 3 captures high-level themes.

The generative process proceeds top-down:

Sample the top-layer latent variables from a prior.
For each subsequent layer, the latent variable rate is a linear function (through the weight matrix) of the layer above.
The bottom layer connects to the observed data through a likelihood.

EasyGuide is a flexible variational family that simplifies the construction of guides for such hierarchical models, handling the bookkeeping of matching guide sites to model sites and managing the shape constraints.

Usage

Use the Sparse Gamma DEF when:

Discovering non-negative, interpretable latent features from count data.
Building deep topic models where multiple levels of abstraction are needed.
Modeling data with inherent non-negativity (count data, images, spectrograms).
You want a deep generative model that is more interpretable than a VAE.
Performing multi-level feature extraction with sparsity for interpretability.

Theoretical Basis

Gamma DEF generative model:

# L layers, with dimensions d_0 (data), d_1, ..., d_L

# Top layer prior:
# z_L ~ Gamma(alpha_L, beta_L)  (d_L-dimensional)

# Intermediate layers (l = L-1, ..., 1):
# rate_l = W_l @ z_{l+1}  (linear function of layer above)
# z_l ~ Gamma(alpha_l, rate_l)

# Observation model:
# x ~ Poisson(W_0 @ z_1)  (for count data)
# or: x ~ Normal(W_0 @ z_1, sigma^2)  (for continuous data)

Sparsity through priors:

# Sparse weight prior:
# W_l[i,j] ~ Gamma(a_w, b_w) * Bernoulli(pi_w)
# a_w small (e.g., 0.1): encourages sparsity via heavy left tail
# pi_w: inclusion probability (spike-and-slab variant)

# Alternative: use Gamma(a_w, b_w) with small a_w
# This concentrates mass near zero while allowing large values
# Effectively achieves soft sparsity without explicit indicators

# Benefits of sparsity:
# - Each latent variable connects to few features (interpretability)
# - Reduces overfitting (implicit regularization)
# - Enables efficient computation (sparse matrix operations)

Variational inference with EasyGuide:

# Guide structure mirrors the model:
# For each latent variable z_l:
#   q(z_l) = Gamma(alpha_q_l, beta_q_l)
#   where alpha_q_l, beta_q_l are variational parameters

# ELBO:
# L = E_q[log p(x | z_1, W_0)]       # reconstruction
#   + sum_l E_q[log p(z_l | z_{l+1})]  # layer conditionals
#   + E_q[log p(z_L)]                   # top-layer prior
#   - sum_l E_q[log q(z_l)]             # entropy

# EasyGuide simplifies:
# - Automatic matching of guide sites to model sites
# - Handles plate dimensions correctly
# - Provides amortization options (shared parameters across plates)

Interpretation of learned structure:

# After training:
# W_0: maps first-layer features to data space
#       (analogous to topic-word distributions in LDA)
# W_1: maps second-layer features to first-layer features
#       (higher-level groupings of topics)
# z_l^{(i)}: latent representation of data point i at level l

# Sparsity means:
# - Each column of W_0 is a sparse data-space pattern (interpretable feature)
# - Each column of W_1 groups a few first-level features into a theme
# - Hierarchy provides multi-resolution understanding of the data

Related Pages

Implementation:Pyro_ppl_Pyro_SparseGammaDEF

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment