Principle:Pyro ppl Pyro Topic Modeling
| Knowledge Sources | |
|---|---|
| Domains | Topic Modeling, Natural Language Processing, Variational Inference |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
Amortized Latent Dirichlet Allocation combines the classical LDA generative model for document collections with neural network-based amortized inference, enabling scalable topic discovery without per-document variational optimization.
Description
Latent Dirichlet Allocation (LDA) is the foundational probabilistic topic model. It models a collection of documents as mixtures of latent "topics," where each topic is a distribution over words. The generative process is:
- For each topic k, draw a word distribution: beta_k ~ Dirichlet(eta).
- For each document d:
- Draw a topic mixture: theta_d ~ Dirichlet(alpha).
- For each word position n in document d:
- Draw a topic assignment: z_{dn} ~ Categorical(theta_d).
- Draw a word: w_{dn} ~ Categorical(beta_{z_{dn}}).
The key latent variables are:
- Topics (beta): Each topic is a probability distribution over the vocabulary, capturing a coherent theme.
- Topic proportions (theta): Each document has a mixture over topics, representing what the document is "about."
- Topic assignments (z): Each word is assigned to a specific topic.
Traditional inference for LDA uses mean-field variational inference with per-document optimization, which is computationally expensive. Amortized LDA replaces this with a neural network encoder that maps a document's bag-of-words representation directly to approximate posterior parameters, enabling:
- Fast inference: A single forward pass through the encoder, instead of iterative optimization per document.
- Scalability: Mini-batch training with stochastic gradient descent.
- Flexibility: The encoder can be any differentiable architecture (MLP, transformer).
This is a key example of how deep learning and probabilistic programming complement each other: the generative model provides interpretability (topics are meaningful), while the neural encoder provides scalability.
Usage
Use amortized LDA when:
- Discovering latent topics in large document collections.
- You need fast inference for new documents (amortization avoids per-document optimization).
- Building interpretable text representations where topics have semantic meaning.
- Combining topic modeling with downstream tasks (classification, retrieval).
- Working with large vocabularies and document collections that require scalable inference.
Theoretical Basis
LDA generative model:
# Hyperparameters: alpha (topic prior), eta (word prior), K (num topics)
# For k = 1, ..., K:
# beta_k ~ Dirichlet(eta) # topic-word distributions
# For d = 1, ..., D:
# theta_d ~ Dirichlet(alpha) # document-topic proportions
# For n = 1, ..., N_d:
# z_{dn} ~ Categorical(theta_d) # topic assignment
# w_{dn} ~ Categorical(beta_{z_{dn}}) # word
Collapsed representation (integrating out z):
# Marginalizing over topic assignments z:
# p(w_d | theta_d, beta) = product_n sum_k theta_{dk} * beta_{k, w_{dn}}
# In bag-of-words form:
# p(w_d | theta_d, beta) = product_v (sum_k theta_{dk} * beta_{kv})^{count(v, d)}
# Log-likelihood per document:
# log p(w_d | theta_d, beta) = sum_v count(v, d) * log(sum_k theta_{dk} * beta_{kv})
Amortized variational inference:
# Standard VI: for each document d, optimize q(theta_d | lambda_d)
# lambda_d = argmax_{lambda} ELBO_d(lambda) -- per-document optimization
# Amortized VI: learn an encoder network
# lambda_d = encoder(bow_d; phi) -- single forward pass
# Encoder: maps bag-of-words vector to Dirichlet parameters
# bow_d: V-dimensional count vector
# phi: encoder neural network weights
# ELBO:
# L = sum_d [E_{q(theta_d)}[log p(w_d | theta_d, beta)] - KL(q(theta_d) || p(theta_d))]
# Reparameterization for Dirichlet:
# Use Laplace approximation or logistic-normal approximation:
# theta_d = softmax(mu_d + sigma_d * epsilon), epsilon ~ Normal(0, I)
# This is the logistic-normal approximation to Dirichlet
Training procedure:
# Parameters: beta (topics), phi (encoder weights)
# For each mini-batch of documents:
# 1. Encode: lambda_d = encoder(bow_d; phi) for d in batch
# 2. Sample: theta_d ~ q(theta | lambda_d) (reparameterized)
# 3. Reconstruct: p(w_d | theta_d, beta)
# 4. ELBO = reconstruction - KL
# 5. Update beta, phi via gradient ascent on ELBO
# After training:
# - beta gives K interpretable topics (word distributions)
# - encoder provides instant topic inference for new documents
# - No iterative optimization needed at test time