Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Dotnet Machinelearning LdaDocumentSampler

From Leeroopedia


Knowledge Sources
Domains Topic_Modeling, NLP, Sampling
Last Updated 2026-02-09 12:00 GMT

Overview

LightDocSampler is the per-thread document-level Gibbs sampler that performs collapsed Gibbs sampling with Metropolis-Hastings acceleration and alias table proposals for efficient LDA topic assignment.

Description

The LightDocSampler class implements the core sampling algorithm used by LdaEngine for both training and inference. Each thread owns its own sampler instance, which holds references (not copies) to the shared global word-topic table, summary row, and alias tables. The sampler uses a two-proposal Metropolis-Hastings scheme inspired by the LightLDA algorithm:

Training (SampleOneDoc / OldProposalFreshSample):

  • For each token in a document, the sampler proposes a new topic assignment using two alternating proposal distributions:
    • Word proposal: Samples from the word-specific alias table alias_k_v_[w], which encodes P(k|w) proportional to (n_kw + beta) / (n_k + beta_sum). This proposal is accepted or rejected via a Metropolis-Hastings ratio that accounts for the document-topic component.
    • Document proposal: Samples from the empirical document-topic distribution (with probability n_td_sum / (n_td_sum + alpha_sum)) or the uniform prior (with probability alpha_sum / (n_td_sum + alpha_sum)). Accepted or rejected via an MH ratio involving the word-topic component.
  • Each proposal undergoes mh_step_for_gs_ rounds of MH steps for improved mixing.
  • Topic changes are recorded as word_topic_delta structs, sharded by word ID for lock-free parallel application.
  • The delta_summary_row_ accumulates net topic count changes for the thread.

Inference (InferOneDoc / OldProposalFreshSampleInfer):

  • Uses Sample2WordFirstInfer() which simplifies the acceptance ratio by dropping the word-topic likelihood terms (since the global model is frozen during inference). Only the document-topic component n_td_alpha is used for the word proposal MH step, while the document proposal uses the full word-topic ratio.
  • Does not record word_topic_delta changes (the global model is not updated during inference).

Key data structures:

  • doc_topic_counter_: A light_hash_map (capacity 1024) that stores the current document's topic counts, rebuilt at the start of each document via DocInit().
  • q_w_proportion_: A K-length vector reused for computing per-word alias table proportions.
  • word_topic_delta_: A vector of vectors (one per thread shard) storing pending word-topic count deltas.

Log-likelihood computation:

  • ComputeOneDocLLH(): Computes the document-topic component using the LogGamma function over the doc-topic counts.
  • ComputeWordLLH(): Computes the word-topic component for a range of words, handling both dense and sparse hybrid_map representations.
  • NormalizeWordLLH(): Adds the K * log_topic_normalizer_ term and subtracts LogGamma(n_k + beta_sum) for each topic.

Usage

This class is instantiated internally by LdaEngine -- one instance per worker thread. It is not called directly from managed code. The engine configures each sampler with shared references to the global tables and invokes SampleOneDoc or InferOneDoc for each document in the thread's partition.

Code Reference

Source Location

Signature

namespace lda {
    struct word_topic_delta {
        int32_t word;
        int32_t topic;
        int32_t delta;
    };

    class LightDocSampler {
    public:
        LightDocSampler(int32_t K, int32_t V, int32_t num_threads, int32_t mh_step,
                        float beta, float alpha_sum,
                        std::vector<lda::hybrid_map>& word_topic_table,
                        std::vector<int64_t>& summary_row,
                        std::vector<lda::hybrid_alias_map>& alias_kv,
                        int32_t& beta_height, float& beta_mass,
                        std::vector<wood::alias_k_v>& beta_k_v);

        int32_t GlobalInit(LDADocument* doc);
        int32_t DocInit(LDADocument* doc);
        void EpocInit();
        void AdaptAlphaSum(bool is_train);

        int32_t SampleOneDoc(LDADocument* doc);
        int32_t InferOneDoc(LDADocument* doc);
        void GetDocTopic(LDADocument* doc, int* pTopics, int* pProbs, int32_t& numTopicsMax);

        double ComputeOneDocLLH(LDADocument* doc);
        double ComputeWordLLH(int32_t lower, int32_t upper);
        double NormalizeWordLLH();

        void build_alias_table(int32_t lower, int32_t upper, int thread_id);
        void build_word_topic_table(int32_t thread_id, int32_t num_threads,
                                    lda::LDAModelBlock& model_block);

    private:
        int32_t Sample2WordFirst(LDADocument* doc, int32_t w, int32_t s, int32_t old_topic);
        int32_t Sample2WordFirstInfer(LDADocument* doc, int32_t w, int32_t s, int32_t old_topic);
        int32_t OldProposalFreshSample(LDADocument* doc);
        int32_t OldProposalFreshSampleInfer(LDADocument* doc);
    };
}

Import

// LightDocSampler is an internal C++ class; it is not directly exposed via P/Invoke.
// It is used internally by LdaEngine which is the exported interface.
// The managed wrapper interacts only with LdaEngine's exported C functions.

I/O Contract

Inputs

Name Type Required Description
K int32_t Yes Number of topics
V int32_t Yes Vocabulary size
num_threads int32_t Yes Thread count for delta sharding
mh_step int32_t Yes Number of Metropolis-Hastings steps per token position
beta float Yes Symmetric Dirichlet word-topic prior
alpha_sum float Yes Sum of Dirichlet document-topic prior
word_topic_table vector<hybrid_map>& Yes Shared global word-topic count matrix (V rows)
summary_row vector<int64_t>& Yes Shared global topic count summary (K entries)
alias_kv vector<hybrid_alias_map>& Yes Shared per-word alias tables (V entries)
doc LDADocument* Yes Document to sample (contains word-topic pairs)

Outputs

Name Type Description
return (SampleOneDoc) int32_t Number of tokens swept in this document
pTopics int* Topic IDs for the document's topic distribution (GetDocTopic)
pProbs int* Topic counts for each returned topic (GetDocTopic)
numTopicsMax int32_t& Actual number of non-zero topics returned
word_topic_delta_ vector<vector<word_topic_delta>> Accumulated word-topic count changes, sharded by thread
delta_summary_row_ vector<int64_t> Net topic count changes for global summary update

Sampling Algorithm Detail

The Sample2WordFirst method implements the two-proposal MH scheme for a single token:

// For each MH step:
// 1. WORD PROPOSAL: sample t from alias table of word w
//    pi = min(1, (n_td[t]+alpha)*(n_tw[t]+beta)*(n_s+beta_sum)*proposal_s /
//               ((n_td[s]+alpha)*(n_sw[s]+beta)*(n_t+beta_sum)*proposal_t))
//    Accept t with probability pi (using branchless bit-mask: m = -(rejection < pi))

// 2. DOC PROPOSAL: sample t from doc-topic distribution or uniform prior
//    With prob n_td_sum/(n_td_sum+alpha_sum): t = topic of random token in doc
//    With prob alpha_sum/(n_td_sum+alpha_sum): t = uniform random topic
//    pi = min(1, (n_td[t]+alpha)*(n_tw[t]+beta)*(n_s+beta_sum)*proposal_s /
//               ((n_td[s]+alpha)*(n_sw[s]+beta)*(n_t+beta_sum)*proposal_t))
//    Accept t with probability pi

The branchless accept/reject pattern used throughout:

int m = -(rejection < pi);  // m = 0xFFFFFFFF if accept, 0x00000000 if reject
s = (t & m) | (s & ~m);     // s = t if accept, s = s if reject

Usage Examples

// Internal C++ usage within LdaEngine::Training_Thread():
LightDocSampler &sampler = *(samplers_[thread_id]);
sampler.AdaptAlphaSum(true);
sampler.build_word_topic_table(thread_id, num_threads_, *model_block_);

// For each iteration:
sampler.build_alias_table(word_lower, word_upper, thread_id);
sampler.EpocInit();
for (int doc_index = doc_start; doc_index < doc_end; ++doc_index) {
    auto doc = data_block_->GetOneDoc(doc_index);
    token_count += sampler.SampleOneDoc(doc.get());
}

// Log-likelihood evaluation:
double doc_ll = sampler.ComputeOneDocLLH(doc.get());
double word_ll = sampler.ComputeWordLLH(word_lower, word_upper);
double norm_ll = sampler.NormalizeWordLLH();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment