Principle:Dotnet Machinelearning Latent Dirichlet Allocation
| Knowledge Sources | |
|---|---|
| Domains | Topic_Modeling, NLP, Bayesian_Inference, Machine_Learning |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Latent Dirichlet Allocation (LDA) is a generative probabilistic model that discovers latent topics in document collections by modeling each document as a mixture of topics and each topic as a distribution over words.
Description
LDA was introduced by Blei, Ng, and Jordan (2003) as a three-level hierarchical Bayesian model for collections of discrete data, particularly text corpora. The key insight is that documents exhibit multiple topics, and each topic is characterized by a distribution over the vocabulary. LDA provides a principled framework for dimensionality reduction of document representations -- from the high-dimensional space of raw word counts to a low-dimensional space of topic proportions.
Generative process: For a corpus of M documents, each with N_d words:
- For each topic k = 1, ..., K: draw word distribution
- For each document d = 1, ..., M:
- Draw topic proportions
- For each word position n = 1, ..., N_d:
- Draw topic assignment
- Draw word
Key properties:
- Exchangeability: Documents are exchangeable (order does not matter), and words within a document are exchangeable ("bag of words" assumption).
- Conjugacy: Dirichlet priors are conjugate to multinomial likelihoods, enabling efficient collapsed inference where and are integrated out analytically.
- Hyperparameters: controls document-topic sparsity (smaller values yield documents focused on fewer topics); controls topic-word sparsity (smaller values yield topics focused on fewer words).
Usage
LDA is used when you need to:
- Discover latent thematic structure in large text corpora
- Produce low-dimensional document representations for downstream tasks (classification, clustering, retrieval)
- Generate interpretable topic summaries (top words per topic)
- Perform document similarity computation in topic space
In ML.NET, LDA is exposed through the LatentDirichletAllocationEstimator which wraps the native LdaNative C++ library.
Theoretical Basis
Collapsed Gibbs Sampling
Direct inference of the posterior P(z|w, alpha, beta) is intractable. The ML.NET implementation uses collapsed Gibbs sampling, where the topic-word distributions and document-topic distributions are analytically marginalized out, and sampling is performed only over the discrete topic assignments z.
The collapsed conditional distribution for assigning topic k to word position (d, n) is:
Where:
- = count of word w assigned to topic k, excluding current position
- = total count of all words assigned to topic k, excluding current position
- = count of words in document d assigned to topic k, excluding current position
- V = vocabulary size
Metropolis-Hastings Acceleration
The ML.NET implementation uses the LightLDA approach (Yuan et al., 2015), which accelerates the standard Gibbs sampler using two Metropolis-Hastings proposal distributions:
Proposal 1 (Word proposal): Proposes a topic from the word-topic distribution:
This is efficiently sampled using a pre-built alias table for each word.
Proposal 2 (Document proposal): Proposes a topic from the document-topic distribution:
Sampled by either picking the topic of a random token in the document (with probability proportional to n_d_sum) or drawing uniformly (with probability proportional to alpha_sum).
Each proposal is accepted or rejected with the standard MH ratio:
where s is the current topic and t is the proposed topic.
Pseudocode
Algorithm: LightLDA Collapsed Gibbs Sampling with MH
Input: corpus W, num_topics K, alpha, beta, num_iterations T, mh_steps M
Output: topic assignments Z
1. Initialize Z randomly
2. Build word-topic count matrix n_kw[K][V] and summary n_k[K]
3. For iteration = 1 to T:
a. For each word w, build alias table for q_w(k) = (n_kw + beta)/(n_k + V*beta)
b. Build global alias table for q_beta(k) = beta/(n_k + V*beta)
c. For each document d:
i. Compute doc-topic counts n_dk from current Z
ii. For each token position (d,n) with word w and current topic s:
For m = 1 to M:
// Word proposal
Sample t from alias_table[w]
Compute pi = MH_ratio(s, t, w, d, n_kw, n_k, n_dk, alpha, beta)
Accept t with probability pi (s <- t)
// Document proposal
Sample t from doc-topic distribution
Compute pi = MH_ratio(s, t, w, d, n_kw, n_k, n_dk, alpha, beta)
Accept t with probability pi (s <- t)
iii. Update n_kw and n_k with topic changes
4. Return Z
Log-Likelihood
The complete data log-likelihood is used to monitor convergence:
Implementation Details
Multi-threaded Training
The ML.NET implementation parallelizes across documents:
- Documents are partitioned evenly across num_threads worker threads.
- Each thread maintains a local delta buffer for word-topic count changes.
- After processing all documents in an iteration, deltas are applied to the global word-topic table using sharded access (each thread owns a range of words) and a mutex-protected global summary update.
- Thread synchronization uses a SimpleBarrier for phase coordination.
Alpha Adaptation
The implementation adapts the alpha_sum hyperparameter:
- Training: If alpha_sum < 10, it is set to 100 (encourages broad topic mixing during learning).
- Inference: If alpha_sum > 10, it is set to 1 (encourages sparse topic assignments for sharper predictions).
- SetAlphaSum: alpha_sum is multiplied by average document length once, reflecting the standard parameterization where per-topic alpha scales with document size.
Related Pages
- Implementation:Dotnet_Machinelearning_LdaEngine
- Implementation:Dotnet_Machinelearning_LdaDocumentSampler
- Implementation:Dotnet_Machinelearning_LdaDataBlock
- Implementation:Dotnet_Machinelearning_LdaModelBlock
- Implementation:Dotnet_Machinelearning_LdaHybridMap
- Implementation:Dotnet_Machinelearning_LdaHybridAliasMap
- Implementation:Dotnet_Machinelearning_AliasMultinomialRng
- Implementation:Dotnet_Machinelearning_LdaLightHashMap
- Principle:Dotnet_Machinelearning_Alias_Method_Sampling
- Principle:Dotnet_Machinelearning_Hybrid_Dense_Sparse_Storage