Implementation:Dotnet Machinelearning LdaDocumentSampler
| Knowledge Sources | |
|---|---|
| Domains | Topic_Modeling, NLP, Sampling |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
LightDocSampler is the per-thread document-level Gibbs sampler that performs collapsed Gibbs sampling with Metropolis-Hastings acceleration and alias table proposals for efficient LDA topic assignment.
Description
The LightDocSampler class implements the core sampling algorithm used by LdaEngine for both training and inference. Each thread owns its own sampler instance, which holds references (not copies) to the shared global word-topic table, summary row, and alias tables. The sampler uses a two-proposal Metropolis-Hastings scheme inspired by the LightLDA algorithm:
Training (SampleOneDoc / OldProposalFreshSample):
- For each token in a document, the sampler proposes a new topic assignment using two alternating proposal distributions:
- Word proposal: Samples from the word-specific alias table alias_k_v_[w], which encodes P(k|w) proportional to (n_kw + beta) / (n_k + beta_sum). This proposal is accepted or rejected via a Metropolis-Hastings ratio that accounts for the document-topic component.
- Document proposal: Samples from the empirical document-topic distribution (with probability n_td_sum / (n_td_sum + alpha_sum)) or the uniform prior (with probability alpha_sum / (n_td_sum + alpha_sum)). Accepted or rejected via an MH ratio involving the word-topic component.
- Each proposal undergoes mh_step_for_gs_ rounds of MH steps for improved mixing.
- Topic changes are recorded as word_topic_delta structs, sharded by word ID for lock-free parallel application.
- The delta_summary_row_ accumulates net topic count changes for the thread.
Inference (InferOneDoc / OldProposalFreshSampleInfer):
- Uses Sample2WordFirstInfer() which simplifies the acceptance ratio by dropping the word-topic likelihood terms (since the global model is frozen during inference). Only the document-topic component n_td_alpha is used for the word proposal MH step, while the document proposal uses the full word-topic ratio.
- Does not record word_topic_delta changes (the global model is not updated during inference).
Key data structures:
- doc_topic_counter_: A light_hash_map (capacity 1024) that stores the current document's topic counts, rebuilt at the start of each document via DocInit().
- q_w_proportion_: A K-length vector reused for computing per-word alias table proportions.
- word_topic_delta_: A vector of vectors (one per thread shard) storing pending word-topic count deltas.
Log-likelihood computation:
- ComputeOneDocLLH(): Computes the document-topic component using the LogGamma function over the doc-topic counts.
- ComputeWordLLH(): Computes the word-topic component for a range of words, handling both dense and sparse hybrid_map representations.
- NormalizeWordLLH(): Adds the K * log_topic_normalizer_ term and subtracts LogGamma(n_k + beta_sum) for each topic.
Usage
This class is instantiated internally by LdaEngine -- one instance per worker thread. It is not called directly from managed code. The engine configures each sampler with shared references to the global tables and invokes SampleOneDoc or InferOneDoc for each document in the thread's partition.
Code Reference
Source Location
- Repository: Dotnet_Machinelearning
- File: src/Native/LdaNative/light_doc_sampler.cpp (667 lines)
- File: src/Native/LdaNative/light_doc_sampler.hpp (187 lines)
Signature
namespace lda {
struct word_topic_delta {
int32_t word;
int32_t topic;
int32_t delta;
};
class LightDocSampler {
public:
LightDocSampler(int32_t K, int32_t V, int32_t num_threads, int32_t mh_step,
float beta, float alpha_sum,
std::vector<lda::hybrid_map>& word_topic_table,
std::vector<int64_t>& summary_row,
std::vector<lda::hybrid_alias_map>& alias_kv,
int32_t& beta_height, float& beta_mass,
std::vector<wood::alias_k_v>& beta_k_v);
int32_t GlobalInit(LDADocument* doc);
int32_t DocInit(LDADocument* doc);
void EpocInit();
void AdaptAlphaSum(bool is_train);
int32_t SampleOneDoc(LDADocument* doc);
int32_t InferOneDoc(LDADocument* doc);
void GetDocTopic(LDADocument* doc, int* pTopics, int* pProbs, int32_t& numTopicsMax);
double ComputeOneDocLLH(LDADocument* doc);
double ComputeWordLLH(int32_t lower, int32_t upper);
double NormalizeWordLLH();
void build_alias_table(int32_t lower, int32_t upper, int thread_id);
void build_word_topic_table(int32_t thread_id, int32_t num_threads,
lda::LDAModelBlock& model_block);
private:
int32_t Sample2WordFirst(LDADocument* doc, int32_t w, int32_t s, int32_t old_topic);
int32_t Sample2WordFirstInfer(LDADocument* doc, int32_t w, int32_t s, int32_t old_topic);
int32_t OldProposalFreshSample(LDADocument* doc);
int32_t OldProposalFreshSampleInfer(LDADocument* doc);
};
}
Import
// LightDocSampler is an internal C++ class; it is not directly exposed via P/Invoke.
// It is used internally by LdaEngine which is the exported interface.
// The managed wrapper interacts only with LdaEngine's exported C functions.
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| K | int32_t | Yes | Number of topics |
| V | int32_t | Yes | Vocabulary size |
| num_threads | int32_t | Yes | Thread count for delta sharding |
| mh_step | int32_t | Yes | Number of Metropolis-Hastings steps per token position |
| beta | float | Yes | Symmetric Dirichlet word-topic prior |
| alpha_sum | float | Yes | Sum of Dirichlet document-topic prior |
| word_topic_table | vector<hybrid_map>& | Yes | Shared global word-topic count matrix (V rows) |
| summary_row | vector<int64_t>& | Yes | Shared global topic count summary (K entries) |
| alias_kv | vector<hybrid_alias_map>& | Yes | Shared per-word alias tables (V entries) |
| doc | LDADocument* | Yes | Document to sample (contains word-topic pairs) |
Outputs
| Name | Type | Description |
|---|---|---|
| return (SampleOneDoc) | int32_t | Number of tokens swept in this document |
| pTopics | int* | Topic IDs for the document's topic distribution (GetDocTopic) |
| pProbs | int* | Topic counts for each returned topic (GetDocTopic) |
| numTopicsMax | int32_t& | Actual number of non-zero topics returned |
| word_topic_delta_ | vector<vector<word_topic_delta>> | Accumulated word-topic count changes, sharded by thread |
| delta_summary_row_ | vector<int64_t> | Net topic count changes for global summary update |
Sampling Algorithm Detail
The Sample2WordFirst method implements the two-proposal MH scheme for a single token:
// For each MH step:
// 1. WORD PROPOSAL: sample t from alias table of word w
// pi = min(1, (n_td[t]+alpha)*(n_tw[t]+beta)*(n_s+beta_sum)*proposal_s /
// ((n_td[s]+alpha)*(n_sw[s]+beta)*(n_t+beta_sum)*proposal_t))
// Accept t with probability pi (using branchless bit-mask: m = -(rejection < pi))
// 2. DOC PROPOSAL: sample t from doc-topic distribution or uniform prior
// With prob n_td_sum/(n_td_sum+alpha_sum): t = topic of random token in doc
// With prob alpha_sum/(n_td_sum+alpha_sum): t = uniform random topic
// pi = min(1, (n_td[t]+alpha)*(n_tw[t]+beta)*(n_s+beta_sum)*proposal_s /
// ((n_td[s]+alpha)*(n_sw[s]+beta)*(n_t+beta_sum)*proposal_t))
// Accept t with probability pi
The branchless accept/reject pattern used throughout:
int m = -(rejection < pi); // m = 0xFFFFFFFF if accept, 0x00000000 if reject
s = (t & m) | (s & ~m); // s = t if accept, s = s if reject
Usage Examples
// Internal C++ usage within LdaEngine::Training_Thread():
LightDocSampler &sampler = *(samplers_[thread_id]);
sampler.AdaptAlphaSum(true);
sampler.build_word_topic_table(thread_id, num_threads_, *model_block_);
// For each iteration:
sampler.build_alias_table(word_lower, word_upper, thread_id);
sampler.EpocInit();
for (int doc_index = doc_start; doc_index < doc_end; ++doc_index) {
auto doc = data_block_->GetOneDoc(doc_index);
token_count += sampler.SampleOneDoc(doc.get());
}
// Log-likelihood evaluation:
double doc_ll = sampler.ComputeOneDocLLH(doc.get());
double word_ll = sampler.ComputeWordLLH(word_lower, word_upper);
double norm_ll = sampler.NormalizeWordLLH();
Related Pages
- Principle:Dotnet_Machinelearning_Latent_Dirichlet_Allocation
- Principle:Dotnet_Machinelearning_Alias_Method_Sampling
- Implementation:Dotnet_Machinelearning_LdaEngine
- Implementation:Dotnet_Machinelearning_LdaHybridMap
- Implementation:Dotnet_Machinelearning_LdaHybridAliasMap
- Implementation:Dotnet_Machinelearning_LdaLightHashMap
- Implementation:Dotnet_Machinelearning_AliasMultinomialRng
- Environment:Dotnet_Machinelearning_Native_Build_Toolchain