Implementation:Dotnet Machinelearning LdaDocumentSampler

Knowledge Sources	Dotnet_Machinelearning LightLDA: Big Topic Models on Modest Computer Clusters
Domains	Topic_Modeling, NLP, Sampling
Last Updated	2026-02-09 12:00 GMT

Overview

LightDocSampler is the per-thread document-level Gibbs sampler that performs collapsed Gibbs sampling with Metropolis-Hastings acceleration and alias table proposals for efficient LDA topic assignment.

Description

The LightDocSampler class implements the core sampling algorithm used by LdaEngine for both training and inference. Each thread owns its own sampler instance, which holds references (not copies) to the shared global word-topic table, summary row, and alias tables. The sampler uses a two-proposal Metropolis-Hastings scheme inspired by the LightLDA algorithm:

Training (SampleOneDoc / OldProposalFreshSample):

For each token in a document, the sampler proposes a new topic assignment using two alternating proposal distributions:
- Word proposal: Samples from the word-specific alias table alias_k_v_[w], which encodes P(k|w) proportional to (n_kw + beta) / (n_k + beta_sum). This proposal is accepted or rejected via a Metropolis-Hastings ratio that accounts for the document-topic component.
- Document proposal: Samples from the empirical document-topic distribution (with probability n_td_sum / (n_td_sum + alpha_sum)) or the uniform prior (with probability alpha_sum / (n_td_sum + alpha_sum)). Accepted or rejected via an MH ratio involving the word-topic component.
Each proposal undergoes mh_step_for_gs_ rounds of MH steps for improved mixing.
Topic changes are recorded as word_topic_delta structs, sharded by word ID for lock-free parallel application.
The delta_summary_row_ accumulates net topic count changes for the thread.

Inference (InferOneDoc / OldProposalFreshSampleInfer):

Uses Sample2WordFirstInfer() which simplifies the acceptance ratio by dropping the word-topic likelihood terms (since the global model is frozen during inference). Only the document-topic component n_td_alpha is used for the word proposal MH step, while the document proposal uses the full word-topic ratio.
Does not record word_topic_delta changes (the global model is not updated during inference).

Key data structures:

doc_topic_counter_: A light_hash_map (capacity 1024) that stores the current document's topic counts, rebuilt at the start of each document via DocInit().
q_w_proportion_: A K-length vector reused for computing per-word alias table proportions.
word_topic_delta_: A vector of vectors (one per thread shard) storing pending word-topic count deltas.

Log-likelihood computation:

ComputeOneDocLLH(): Computes the document-topic component using the LogGamma function over the doc-topic counts.
ComputeWordLLH(): Computes the word-topic component for a range of words, handling both dense and sparse hybrid_map representations.
NormalizeWordLLH(): Adds the K * log_topic_normalizer_ term and subtracts LogGamma(n_k + beta_sum) for each topic.

Usage

This class is instantiated internally by LdaEngine -- one instance per worker thread. It is not called directly from managed code. The engine configures each sampler with shared references to the global tables and invokes SampleOneDoc or InferOneDoc for each document in the thread's partition.

Code Reference

Source Location

Repository: Dotnet_Machinelearning
File: src/Native/LdaNative/light_doc_sampler.cpp (667 lines)
File: src/Native/LdaNative/light_doc_sampler.hpp (187 lines)

Signature

namespace lda {
    struct word_topic_delta {
        int32_t word;
        int32_t topic;
        int32_t delta;
    };

    class LightDocSampler {
    public:
        LightDocSampler(int32_t K, int32_t V, int32_t num_threads, int32_t mh_step,
                        float beta, float alpha_sum,
                        std::vector<lda::hybrid_map>& word_topic_table,
                        std::vector<int64_t>& summary_row,
                        std::vector<lda::hybrid_alias_map>& alias_kv,
                        int32_t& beta_height, float& beta_mass,
                        std::vector<wood::alias_k_v>& beta_k_v);

        int32_t GlobalInit(LDADocument* doc);
        int32_t DocInit(LDADocument* doc);
        void EpocInit();
        void AdaptAlphaSum(bool is_train);

        int32_t SampleOneDoc(LDADocument* doc);
        int32_t InferOneDoc(LDADocument* doc);
        void GetDocTopic(LDADocument* doc, int* pTopics, int* pProbs, int32_t& numTopicsMax);

        double ComputeOneDocLLH(LDADocument* doc);
        double ComputeWordLLH(int32_t lower, int32_t upper);
        double NormalizeWordLLH();

        void build_alias_table(int32_t lower, int32_t upper, int thread_id);
        void build_word_topic_table(int32_t thread_id, int32_t num_threads,
                                    lda::LDAModelBlock& model_block);

    private:
        int32_t Sample2WordFirst(LDADocument* doc, int32_t w, int32_t s, int32_t old_topic);
        int32_t Sample2WordFirstInfer(LDADocument* doc, int32_t w, int32_t s, int32_t old_topic);
        int32_t OldProposalFreshSample(LDADocument* doc);
        int32_t OldProposalFreshSampleInfer(LDADocument* doc);
    };
}

Import

// LightDocSampler is an internal C++ class; it is not directly exposed via P/Invoke.
// It is used internally by LdaEngine which is the exported interface.
// The managed wrapper interacts only with LdaEngine's exported C functions.

I/O Contract

Inputs

Name	Type	Required	Description
K	int32_t	Yes	Number of topics
V	int32_t	Yes	Vocabulary size
num_threads	int32_t	Yes	Thread count for delta sharding
mh_step	int32_t	Yes	Number of Metropolis-Hastings steps per token position
beta	float	Yes	Symmetric Dirichlet word-topic prior
alpha_sum	float	Yes	Sum of Dirichlet document-topic prior
word_topic_table	vector<hybrid_map>&	Yes	Shared global word-topic count matrix (V rows)
summary_row	vector<int64_t>&	Yes	Shared global topic count summary (K entries)
alias_kv	vector<hybrid_alias_map>&	Yes	Shared per-word alias tables (V entries)
doc	LDADocument*	Yes	Document to sample (contains word-topic pairs)

Outputs

Name	Type	Description
return (SampleOneDoc)	int32_t	Number of tokens swept in this document
pTopics	int*	Topic IDs for the document's topic distribution (GetDocTopic)
pProbs	int*	Topic counts for each returned topic (GetDocTopic)
numTopicsMax	int32_t&	Actual number of non-zero topics returned
word_topic_delta_	vector<vector<word_topic_delta>>	Accumulated word-topic count changes, sharded by thread
delta_summary_row_	vector<int64_t>	Net topic count changes for global summary update

Sampling Algorithm Detail

The Sample2WordFirst method implements the two-proposal MH scheme for a single token:

// For each MH step:
// 1. WORD PROPOSAL: sample t from alias table of word w
//    pi = min(1, (n_td[t]+alpha)*(n_tw[t]+beta)*(n_s+beta_sum)*proposal_s /
//               ((n_td[s]+alpha)*(n_sw[s]+beta)*(n_t+beta_sum)*proposal_t))
//    Accept t with probability pi (using branchless bit-mask: m = -(rejection < pi))

// 2. DOC PROPOSAL: sample t from doc-topic distribution or uniform prior
//    With prob n_td_sum/(n_td_sum+alpha_sum): t = topic of random token in doc
//    With prob alpha_sum/(n_td_sum+alpha_sum): t = uniform random topic
//    pi = min(1, (n_td[t]+alpha)*(n_tw[t]+beta)*(n_s+beta_sum)*proposal_s /
//               ((n_td[s]+alpha)*(n_sw[s]+beta)*(n_t+beta_sum)*proposal_t))
//    Accept t with probability pi

The branchless accept/reject pattern used throughout:

int m = -(rejection < pi);  // m = 0xFFFFFFFF if accept, 0x00000000 if reject
s = (t & m) | (s & ~m);     // s = t if accept, s = s if reject

Usage Examples

// Internal C++ usage within LdaEngine::Training_Thread():
LightDocSampler &sampler = *(samplers_[thread_id]);
sampler.AdaptAlphaSum(true);
sampler.build_word_topic_table(thread_id, num_threads_, *model_block_);

// For each iteration:
sampler.build_alias_table(word_lower, word_upper, thread_id);
sampler.EpocInit();
for (int doc_index = doc_start; doc_index < doc_end; ++doc_index) {
    auto doc = data_block_->GetOneDoc(doc_index);
    token_count += sampler.SampleOneDoc(doc.get());
}

// Log-likelihood evaluation:
double doc_ll = sampler.ComputeOneDocLLH(doc.get());
double word_ll = sampler.ComputeWordLLH(word_lower, word_upper);
double norm_ll = sampler.NormalizeWordLLH();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment