Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Dotnet Machinelearning LdaEngine

From Leeroopedia


Knowledge Sources
Domains Topic_Modeling, NLP, Machine_Learning
Last Updated 2026-02-09 12:00 GMT

Overview

LdaEngine is the top-level native C++ class that orchestrates the entire Latent Dirichlet Allocation pipeline, including data ingestion, multi-threaded collapsed Gibbs sampling for training and inference, model serialization, and log-likelihood evaluation.

Description

The LdaEngine class lives in the lda namespace and serves as the single entry point for the ML.NET LDA implementation. It manages the complete lifecycle of topic model training and inference:

  • Data management: Accepts sparse or dense document-term data through FeedInData() and FeedInDataDense(), storing it in an internal LDADataBlock.
  • Model management: Allocates and manages the word-topic count matrix via LDAModelBlock, supporting three overloads of AllocateModelMemory() for different initialization scenarios (from data block, from nonzero count, or from explicit memory sizes).
  • Multi-threaded training: The Train() method spawns num_threads_ worker threads, each executing Training_Thread(). Each iteration involves: (1) building alias tables for the beta smoothing term and per-word distributions, (2) sampling all documents in the thread's partition via LightDocSampler::SampleOneDoc(), and (3) synchronizing word-topic deltas across threads using a barrier and mutex-protected global summary update.
  • Multi-threaded testing: The Test() method similarly spawns threads running Testing_Thread(), performing burn-in iterations of inference using LightDocSampler::InferOneDoc() while computing per-iteration log-likelihood.
  • Single-document inference: TestOneDoc() and TestOneDocDense() provide thread-safe single-document inference by drawing samplers from a CBlockedIntQueue, enabling concurrent calls from multiple C# threads.
  • Thread affinity: Training and testing threads set CPU affinity masks (platform-specific: Windows SetThreadAffinityMask, macOS THREAD_AFFINITY_POLICY, Linux sched_setaffinity) for improved cache locality.
  • Alpha adaptation: SetAlphaSum() multiplies the per-topic alpha by average document length, while LightDocSampler::AdaptAlphaSum() adjusts alpha between training (minimum 100) and inference (maximum 1) regimes.
  • Log-likelihood evaluation: EvalLogLikelihood() computes the complete data log-likelihood by summing document-topic, word-topic, and normalization terms across threads.

The native library is exposed to ML.NET managed code through a set of C-linkage exports defined in lda_engine_export.cpp, using the EXPORT_API macro for cross-platform shared library symbol visibility.

Usage

This engine is used when ML.NET's LatentDirichletAllocationEstimator needs to train or apply an LDA topic model. The managed C# code calls the exported C functions via P/Invoke, passing raw pointers to term IDs, frequencies, and output buffers. The typical workflow is:

  1. Call CreateEngine with hyperparameters.
  2. Call AllocateDataMemory for the corpus size.
  3. Call FeedInData or FeedInDataDense for each document.
  4. Call InitializeBeforeTrain to set up model and samplers.
  5. Call Train to run collapsed Gibbs sampling.
  6. Call GetModelStat, GetWordTopic to serialize the model.
  7. For inference: AllocateModelMemory, SetWordTopic to load a model, then InitializeBeforeTest, and TestOneDoc/Test.

Code Reference

Source Location

Signature

namespace lda {
    class LdaEngine {
    public:
        LdaEngine(int numTopic, int numVocab, float alphaSum, float beta,
                  int numIter, int likelihoodInterval, int numThread,
                  int mhstep, int maxDocToken);

        void AllocateDataMemory(int num_document, int64_t corpus_size);
        void AllocateModelMemory(const LDADataBlock& data_block);
        void AllocateModelMemory(int num_vocabs, int num_topics, int64_t nonzero_num);
        void AllocateModelMemory(int num_vocabs, int num_topics,
                                 int64_t mem_block_size, int64_t alias_mem_block_size);

        int FeedInData(int* term_id, int* term_freq, int32_t term_num, int32_t vocab_size);
        int FeedInDataDense(int* term_freq, int32_t term_num, int32_t vocab_size);
        void SetAlphaSum(float avgDocLength);

        bool InitializeBeforeTrain();
        void InitializeBeforeTest();

        void Train(const char* pTrainOutput = nullptr);
        void Test(int32_t burnin_iter, float* pLoglikelihood);

        void TestOneDoc(int* term_id, int* term_freq, int32_t term_num,
                        int* pTopics, int* pProbs, int32_t& numTopicsMax,
                        int32_t numBurnIter, bool reset);
        void TestOneDocDense(int* term_freq, int32_t term_num,
                             int* pTopics, int* pProbs, int32_t& numTopicsMax,
                             int32_t numBurnIter, bool reset);

        void GetWordTopic(int32_t wordId, int32_t* pTopic, int32_t* pProb, int32_t& length);
        void SetWordTopic(int32_t wordId, int32_t* pTopic, int32_t* pProb, int32_t length);
        void GetDocTopic(int docID, int* pTopic, int* pProb, int32_t& numTopicReturn);
        void GetTopicSummary(int32_t topicId, int32_t* pWords, float* pProb, int32_t& length);
        void GetModelStat(int64_t& memBlockSize, int64_t& aliasMemBlockSize);

        bool ClearData();
        bool ClearModel();
    };
}

C Export API (lda_engine_export.cpp)

EXPORT_API(LdaEngine*) CreateEngine(int numTopic, int numVocab, float alphaSum,
    float beta, int numIter, int likelihoodInterval, int numThread,
    int mhstep, int maxDocToken);
EXPORT_API(void) DestroyEngine(LdaEngine* engine);
EXPORT_API(void) AllocateModelMemory(LdaEngine* engine, int numTopic, int numVocab,
    int64_t tableSize, int64_t aliasTableSize);
EXPORT_API(void) AllocateDataMemory(LdaEngine* engine, int num_document, int64_t corpus_size);
EXPORT_API(void) Train(LdaEngine* engine, const char* trainOutput);
EXPORT_API(void) Test(LdaEngine* engine, int32_t burnin_iter, float* pLoglikelihood);
EXPORT_API(void) CleanData(LdaEngine* engine);
EXPORT_API(void) CleanModel(LdaEngine* engine);
EXPORT_API(void) GetModelStat(LdaEngine* engine, int64_t& memBlockSize, int64_t& aliasMemBlockSize);
EXPORT_API(void) GetWordTopic(LdaEngine* engine, int32_t wordId, int32_t* pTopic,
    int32_t* pProb, int32_t& length);
EXPORT_API(void) SetWordTopic(LdaEngine* engine, int32_t wordId, int32_t* pTopic,
    int32_t* pProb, int32_t length);
EXPORT_API(void) GetTopicSummary(LdaEngine* engine, int32_t topicId, int32_t* pWords,
    float* pProb, int32_t& length);
EXPORT_API(void) SetAlphaSum(LdaEngine* engine, float avgDocLength);
EXPORT_API(int) FeedInData(LdaEngine* engine, int* term_id, int* term_freq,
    int32_t term_num, int32_t vocab_size);
EXPORT_API(int) FeedInDataDense(LdaEngine* engine, int* term_freq,
    int32_t term_num, int32_t vocab_size);
EXPORT_API(void) GetDocTopic(LdaEngine* engine, int docID, int* pTopic,
    int* pProb, int32_t& numTopicReturn);
EXPORT_API(void) TestOneDoc(LdaEngine* engine, int* term_id, int* term_freq,
    int32_t term_num, int* pTopics, int* pProbs, int32_t& numTopicsMax,
    int32_t numBurnIter, bool reset);
EXPORT_API(void) TestOneDocDense(LdaEngine* engine, int* term_freq, int32_t term_num,
    int* pTopics, int* pProbs, int32_t& numTopicsMax, int32_t numBurnIter, bool reset);
EXPORT_API(void) InitializeBeforeTrain(LdaEngine* engine);
EXPORT_API(void) InitializeBeforeTest(LdaEngine* engine);

Import

// P/Invoke from C# managed code (ML.NET LdaNative interop)
[DllImport("LdaNative")]
private static extern SafeLdaEngineHandle CreateEngine(
    int numTopic, int numVocab, float alphaSum, float beta,
    int numIter, int likelihoodInterval, int numThread,
    int mhstep, int maxDocToken);

[DllImport("LdaNative")]
private static extern void DestroyEngine(IntPtr engine);

[DllImport("LdaNative")]
private static extern void AllocateDataMemory(
    SafeLdaEngineHandle engine, int numDocument, long corpusSize);

[DllImport("LdaNative")]
private static extern int FeedInData(
    SafeLdaEngineHandle engine, int* termId, int* termFreq,
    int termNum, int vocabSize);

[DllImport("LdaNative")]
private static extern void InitializeBeforeTrain(SafeLdaEngineHandle engine);

[DllImport("LdaNative")]
private static extern void Train(SafeLdaEngineHandle engine, string trainOutput);

I/O Contract

Inputs

Name Type Required Description
numTopic int Yes Number of latent topics K. Controls model capacity.
numVocab int Yes Vocabulary size V. Total number of unique terms.
alphaSum float Yes Sum of Dirichlet document-topic prior. Multiplied by avgDocLength via SetAlphaSum.
beta float Yes Dirichlet word-topic prior (symmetric). Typical values: 0.01-0.1. beta_sum = beta * V.
numIter int Yes Number of Gibbs sampling iterations for training.
likelihoodInterval int Yes Compute log-likelihood every N iterations. Set to -1 to disable.
numThread int Yes Number of worker threads. If 0 or negative, defaults to hardware_concurrency() - 2.
mhstep int Yes Number of Metropolis-Hastings steps per token in the Gibbs sampler.
maxDocToken int Yes Maximum number of tokens per document. Used to pre-allocate per-thread document buffers (size = maxDocToken * 2 + 1).
term_id int* Yes (FeedInData) Array of term IDs for a single document.
term_freq int* Yes Array of term frequencies corresponding to term_id entries.
term_num int32_t Yes Number of unique terms in the document being fed.

Outputs

Name Type Description
pTopics int* Output array of topic IDs for GetWordTopic/GetDocTopic/TestOneDoc.
pProbs int*/float* Output array of topic counts or probabilities corresponding to pTopics.
numTopicsMax int32_t& In/out: max topics to return on input, actual count on output.
pLoglikelihood float* Output array of per-iteration log-likelihood values from Test().
memBlockSize int64_t& Output: size of the model's word-topic memory block.
aliasMemBlockSize int64_t& Output: size of the alias table memory block.

Usage Examples

// Typical ML.NET LDA training workflow via P/Invoke
// Step 1: Create engine with hyperparameters
var engine = CreateEngine(
    numTopic: 100,    // K=100 topics
    numVocab: 50000,  // V=50000 vocabulary
    alphaSum: 100.0f, // will be adjusted by SetAlphaSum
    beta: 0.01f,      // symmetric Dirichlet prior
    numIter: 200,     // 200 Gibbs sampling iterations
    likelihoodInterval: 10, // log-likelihood every 10 iters
    numThread: 4,     // 4 worker threads
    mhstep: 4,        // 4 MH steps per token
    maxDocToken: 512  // max tokens per document
);

// Step 2: Allocate data memory
AllocateDataMemory(engine, numDocuments, corpusSize);

// Step 3: Feed in documents
foreach (var doc in documents)
{
    FeedInData(engine, doc.TermIds, doc.TermFreqs, doc.NumTerms, vocabSize);
}

// Step 4: Set alpha based on average document length
SetAlphaSum(engine, avgDocLength);

// Step 5: Initialize and train
InitializeBeforeTrain(engine);
Train(engine, null);

// Step 6: Extract trained model
long memSize, aliasMemSize;
GetModelStat(engine, ref memSize, ref aliasMemSize);

// Step 7: Single-document inference
int numTopics = 100;
TestOneDoc(engine, termIds, termFreqs, termNum,
           topicBuf, probBuf, ref numTopics, burnInIter: 100, reset: true);

DestroyEngine(engine);

Internal Architecture

Key Private Members

Member Type Description
K_ int32_t Number of topics
V_ int32_t Vocabulary size
beta_ / beta_sum_ float Word-topic Dirichlet prior and its sum (beta * V)
alpha_sum_ float Document-topic Dirichlet prior sum
global_word_topic_table_ vector<hybrid_map> V-length array of word-topic count rows
global_alias_k_v_ vector<hybrid_alias_map> V-length array of per-word alias tables
global_summary_row_ vector<int64_t> K-length topic count summary (sum over all words)
data_block_ unique_ptr<LDADataBlock> Holds all document-term data
model_block_ unique_ptr<LDAModelBlock> Manages word-topic memory layout
samplers_ unique_ptr<unique_ptr<LightDocSampler>[]> Per-thread Gibbs samplers
process_barrier_ unique_ptr<SimpleBarrier> Thread synchronization barrier
samplerQueue_ unique_ptr<CBlockedIntQueue> Thread-safe queue for TestOneDoc concurrency
document_buffer_ int32_t** Per-thread document encoding buffers

Threading Model

Training and testing use a fork-join pattern: the main thread spawns N worker threads, each processing a disjoint partition of documents. Synchronization points use SimpleBarrier (barrier wait). The global word-topic table is updated in a partitioned manner: each thread owns a range of words (word_range_for_each_thread_), and word-topic deltas are sharded by word ID modulo num_threads. The global summary row uses a mutex for atomic updates.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment