Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Dotnet Machinelearning Latent Dirichlet Allocation

From Leeroopedia


Knowledge Sources
Domains Topic_Modeling, NLP, Bayesian_Inference, Machine_Learning
Last Updated 2026-02-09 12:00 GMT

Overview

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that discovers latent topics in document collections by modeling each document as a mixture of topics and each topic as a distribution over words.

Description

LDA was introduced by Blei, Ng, and Jordan (2003) as a three-level hierarchical Bayesian model for collections of discrete data, particularly text corpora. The key insight is that documents exhibit multiple topics, and each topic is characterized by a distribution over the vocabulary. LDA provides a principled framework for dimensionality reduction of document representations -- from the high-dimensional space of raw word counts to a low-dimensional space of topic proportions.

Generative process: For a corpus of M documents, each with N_d words:

  1. For each topic k = 1, ..., K: draw word distribution ϕkDir(β)
  2. For each document d = 1, ..., M:
    1. Draw topic proportions θdDir(α)
    2. For each word position n = 1, ..., N_d:
      1. Draw topic assignment zd,nMult(θd)
      2. Draw word wd,nMult(ϕzd,n)

Key properties:

  • Exchangeability: Documents are exchangeable (order does not matter), and words within a document are exchangeable ("bag of words" assumption).
  • Conjugacy: Dirichlet priors are conjugate to multinomial likelihoods, enabling efficient collapsed inference where θd and ϕk are integrated out analytically.
  • Hyperparameters: α controls document-topic sparsity (smaller values yield documents focused on fewer topics); β controls topic-word sparsity (smaller values yield topics focused on fewer words).

Usage

LDA is used when you need to:

  • Discover latent thematic structure in large text corpora
  • Produce low-dimensional document representations for downstream tasks (classification, clustering, retrieval)
  • Generate interpretable topic summaries (top words per topic)
  • Perform document similarity computation in topic space

In ML.NET, LDA is exposed through the LatentDirichletAllocationEstimator which wraps the native LdaNative C++ library.

Theoretical Basis

Collapsed Gibbs Sampling

Direct inference of the posterior P(z|w, alpha, beta) is intractable. The ML.NET implementation uses collapsed Gibbs sampling, where the topic-word distributions ϕ and document-topic distributions θ are analytically marginalized out, and sampling is performed only over the discrete topic assignments z.

The collapsed conditional distribution for assigning topic k to word position (d, n) is:

P(zd,n=k𝐳(d,n),𝐰,α,β)nk,wd,n(d,n)+βnk,(d,n)+Vβ(nd,k(d,n)+α)

Where:

  • nk,w(d,n) = count of word w assigned to topic k, excluding current position
  • nk,(d,n) = total count of all words assigned to topic k, excluding current position
  • nd,k(d,n) = count of words in document d assigned to topic k, excluding current position
  • V = vocabulary size

Metropolis-Hastings Acceleration

The ML.NET implementation uses the LightLDA approach (Yuan et al., 2015), which accelerates the standard Gibbs sampler using two Metropolis-Hastings proposal distributions:

Proposal 1 (Word proposal): Proposes a topic from the word-topic distribution: qw(k)nk,w+βnk,+Vβ

This is efficiently sampled using a pre-built alias table for each word.

Proposal 2 (Document proposal): Proposes a topic from the document-topic distribution: qd(k)nd,k+α

Sampled by either picking the topic of a random token in the document (with probability proportional to n_d_sum) or drawing uniformly (with probability proportional to alpha_sum).

Each proposal is accepted or rejected with the standard MH ratio:

π=min(1,P(t)q(s)P(s)q(t))

where s is the current topic and t is the proposed topic.

Pseudocode

Algorithm: LightLDA Collapsed Gibbs Sampling with MH

Input: corpus W, num_topics K, alpha, beta, num_iterations T, mh_steps M
Output: topic assignments Z

1. Initialize Z randomly
2. Build word-topic count matrix n_kw[K][V] and summary n_k[K]
3. For iteration = 1 to T:
   a. For each word w, build alias table for q_w(k) = (n_kw + beta)/(n_k + V*beta)
   b. Build global alias table for q_beta(k) = beta/(n_k + V*beta)
   c. For each document d:
      i.   Compute doc-topic counts n_dk from current Z
      ii.  For each token position (d,n) with word w and current topic s:
           For m = 1 to M:
             // Word proposal
             Sample t from alias_table[w]
             Compute pi = MH_ratio(s, t, w, d, n_kw, n_k, n_dk, alpha, beta)
             Accept t with probability pi (s <- t)
             // Document proposal
             Sample t from doc-topic distribution
             Compute pi = MH_ratio(s, t, w, d, n_kw, n_k, n_dk, alpha, beta)
             Accept t with probability pi (s <- t)
      iii. Update n_kw and n_k with topic changes
4. Return Z

Log-Likelihood

The complete data log-likelihood is used to monitor convergence:

logP(𝐰,𝐳α,β)=d[logΓ(Kα)KlogΓ(α)+klogΓ(nd,k+α)logΓ(Nd+Kα)]document-topic term+k[logΓ(Vβ)VlogΓ(β)+wlogΓ(nk,w+β)logΓ(nk,+Vβ)]word-topic term

Implementation Details

Multi-threaded Training

The ML.NET implementation parallelizes across documents:

  • Documents are partitioned evenly across num_threads worker threads.
  • Each thread maintains a local delta buffer for word-topic count changes.
  • After processing all documents in an iteration, deltas are applied to the global word-topic table using sharded access (each thread owns a range of words) and a mutex-protected global summary update.
  • Thread synchronization uses a SimpleBarrier for phase coordination.

Alpha Adaptation

The implementation adapts the alpha_sum hyperparameter:

  • Training: If alpha_sum < 10, it is set to 100 (encourages broad topic mixing during learning).
  • Inference: If alpha_sum > 10, it is set to 1 (encourages sparse topic assignments for sharper predictions).
  • SetAlphaSum: alpha_sum is multiplied by average document length once, reflecting the standard parameterization where per-topic alpha scales with document size.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment