Principle:Dotnet Machinelearning Latent Dirichlet Allocation

Knowledge Sources	Latent Dirichlet Allocation LightLDA: Big Topic Models on Modest Computer Clusters Finding Scientific Topics
Domains	Topic_Modeling, NLP, Bayesian_Inference, Machine_Learning
Last Updated	2026-02-09 12:00 GMT

Overview

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that discovers latent topics in document collections by modeling each document as a mixture of topics and each topic as a distribution over words.

Description

LDA was introduced by Blei, Ng, and Jordan (2003) as a three-level hierarchical Bayesian model for collections of discrete data, particularly text corpora. The key insight is that documents exhibit multiple topics, and each topic is characterized by a distribution over the vocabulary. LDA provides a principled framework for dimensionality reduction of document representations -- from the high-dimensional space of raw word counts to a low-dimensional space of topic proportions.

Generative process: For a corpus of M documents, each with N_d words:

For each topic k = 1, ..., K: draw word distribution $ϕ_{k} \sim Dir (β)$
For each document d = 1, ..., M:
1. Draw topic proportions $θ_{d} \sim Dir (α)$
2. For each word position n = 1, ..., N_d:
  1. Draw topic assignment $z_{d, n} \sim Mult (θ_{d})$
  2. Draw word $w_{d, n} \sim Mult (ϕ_{z_{d, n}})$

Key properties:

Exchangeability: Documents are exchangeable (order does not matter), and words within a document are exchangeable ("bag of words" assumption).
Conjugacy: Dirichlet priors are conjugate to multinomial likelihoods, enabling efficient collapsed inference where $θ_{d}$ and $ϕ_{k}$ are integrated out analytically.
Hyperparameters: $α$ controls document-topic sparsity (smaller values yield documents focused on fewer topics); $β$ controls topic-word sparsity (smaller values yield topics focused on fewer words).

Usage

LDA is used when you need to:

Discover latent thematic structure in large text corpora
Produce low-dimensional document representations for downstream tasks (classification, clustering, retrieval)
Generate interpretable topic summaries (top words per topic)
Perform document similarity computation in topic space

In ML.NET, LDA is exposed through the LatentDirichletAllocationEstimator which wraps the native LdaNative C++ library.

Theoretical Basis

Collapsed Gibbs Sampling

Direct inference of the posterior P(z|w, alpha, beta) is intractable. The ML.NET implementation uses collapsed Gibbs sampling, where the topic-word distributions $ϕ$ and document-topic distributions $θ$ are analytically marginalized out, and sampling is performed only over the discrete topic assignments z.

The collapsed conditional distribution for assigning topic k to word position (d, n) is:

$P (z_{d, n} = k ∣ 𝐳_{- (d, n)}, 𝐰, α, β) \propto \frac{n_{k, w_{d, n}}^{- (d, n)} + β}{n_{k, \cdot}^{- (d, n)} + V β} \cdot (n_{d, k}^{- (d, n)} + α)$

Where:

$n_{k, w}^{- (d, n)}$ = count of word w assigned to topic k, excluding current position
$n_{k, \cdot}^{- (d, n)}$ = total count of all words assigned to topic k, excluding current position
$n_{d, k}^{- (d, n)}$ = count of words in document d assigned to topic k, excluding current position
V = vocabulary size

Metropolis-Hastings Acceleration

The ML.NET implementation uses the LightLDA approach (Yuan et al., 2015), which accelerates the standard Gibbs sampler using two Metropolis-Hastings proposal distributions:

Proposal 1 (Word proposal): Proposes a topic from the word-topic distribution: $q_{w} (k) \propto \frac{n_{k, w} + β}{n_{k, \cdot} + V β}$

This is efficiently sampled using a pre-built alias table for each word.

Proposal 2 (Document proposal): Proposes a topic from the document-topic distribution: $q_{d} (k) \propto n_{d, k} + α$

Sampled by either picking the topic of a random token in the document (with probability proportional to n_d_sum) or drawing uniformly (with probability proportional to alpha_sum).

Each proposal is accepted or rejected with the standard MH ratio:

$π = \min (1, \frac{P (t) \cdot q (s)}{P (s) \cdot q (t)})$

where s is the current topic and t is the proposed topic.

Pseudocode

Algorithm: LightLDA Collapsed Gibbs Sampling with MH

Input: corpus W, num_topics K, alpha, beta, num_iterations T, mh_steps M
Output: topic assignments Z

1. Initialize Z randomly
2. Build word-topic count matrix n_kw[K][V] and summary n_k[K]
3. For iteration = 1 to T:
   a. For each word w, build alias table for q_w(k) = (n_kw + beta)/(n_k + V*beta)
   b. Build global alias table for q_beta(k) = beta/(n_k + V*beta)
   c. For each document d:
      i.   Compute doc-topic counts n_dk from current Z
      ii.  For each token position (d,n) with word w and current topic s:
           For m = 1 to M:
             // Word proposal
             Sample t from alias_table[w]
             Compute pi = MH_ratio(s, t, w, d, n_kw, n_k, n_dk, alpha, beta)
             Accept t with probability pi (s <- t)
             // Document proposal
             Sample t from doc-topic distribution
             Compute pi = MH_ratio(s, t, w, d, n_kw, n_k, n_dk, alpha, beta)
             Accept t with probability pi (s <- t)
      iii. Update n_kw and n_k with topic changes
4. Return Z

Log-Likelihood

The complete data log-likelihood is used to monitor convergence:

$\log P (𝐰, 𝐳 ∣ α, β) = \underset{document-topic term}{\underset{⏟}{\sum_{d} [\log Γ (K α) - K \log Γ (α) + \sum_{k} \log Γ (n_{d, k} + α) - \log Γ (N_{d} + K α)]}} + \underset{word-topic term}{\underset{⏟}{\sum_{k} [\log Γ (V β) - V \log Γ (β) + \sum_{w} \log Γ (n_{k, w} + β) - \log Γ (n_{k, \cdot} + V β)]}}$

Implementation Details

Multi-threaded Training

The ML.NET implementation parallelizes across documents:

Documents are partitioned evenly across num_threads worker threads.
Each thread maintains a local delta buffer for word-topic count changes.
After processing all documents in an iteration, deltas are applied to the global word-topic table using sharded access (each thread owns a range of words) and a mutex-protected global summary update.
Thread synchronization uses a SimpleBarrier for phase coordination.

Alpha Adaptation

The implementation adapts the alpha_sum hyperparameter:

Training: If alpha_sum < 10, it is set to 100 (encourages broad topic mixing during learning).
Inference: If alpha_sum > 10, it is set to 1 (encourages sparse topic assignments for sharper predictions).
SetAlphaSum: alpha_sum is multiplied by average document length once, reflecting the standard parameterization where per-topic alpha scales with document size.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment