Implementation:Dotnet Machinelearning LdaEngine

Knowledge Sources	Dotnet_Machinelearning
Domains	Topic_Modeling, NLP, Machine_Learning
Last Updated	2026-02-09 12:00 GMT

Overview

LdaEngine is the top-level native C++ class that orchestrates the entire Latent Dirichlet Allocation pipeline, including data ingestion, multi-threaded collapsed Gibbs sampling for training and inference, model serialization, and log-likelihood evaluation.

Description

The LdaEngine class lives in the lda namespace and serves as the single entry point for the ML.NET LDA implementation. It manages the complete lifecycle of topic model training and inference:

Data management: Accepts sparse or dense document-term data through FeedInData() and FeedInDataDense(), storing it in an internal LDADataBlock.
Model management: Allocates and manages the word-topic count matrix via LDAModelBlock, supporting three overloads of AllocateModelMemory() for different initialization scenarios (from data block, from nonzero count, or from explicit memory sizes).
Multi-threaded training: The Train() method spawns num_threads_ worker threads, each executing Training_Thread(). Each iteration involves: (1) building alias tables for the beta smoothing term and per-word distributions, (2) sampling all documents in the thread's partition via LightDocSampler::SampleOneDoc(), and (3) synchronizing word-topic deltas across threads using a barrier and mutex-protected global summary update.
Multi-threaded testing: The Test() method similarly spawns threads running Testing_Thread(), performing burn-in iterations of inference using LightDocSampler::InferOneDoc() while computing per-iteration log-likelihood.
Single-document inference: TestOneDoc() and TestOneDocDense() provide thread-safe single-document inference by drawing samplers from a CBlockedIntQueue, enabling concurrent calls from multiple C# threads.
Thread affinity: Training and testing threads set CPU affinity masks (platform-specific: Windows SetThreadAffinityMask, macOS THREAD_AFFINITY_POLICY, Linux sched_setaffinity) for improved cache locality.
Alpha adaptation: SetAlphaSum() multiplies the per-topic alpha by average document length, while LightDocSampler::AdaptAlphaSum() adjusts alpha between training (minimum 100) and inference (maximum 1) regimes.
Log-likelihood evaluation: EvalLogLikelihood() computes the complete data log-likelihood by summing document-topic, word-topic, and normalization terms across threads.

The native library is exposed to ML.NET managed code through a set of C-linkage exports defined in lda_engine_export.cpp, using the EXPORT_API macro for cross-platform shared library symbol visibility.

Usage

This engine is used when ML.NET's LatentDirichletAllocationEstimator needs to train or apply an LDA topic model. The managed C# code calls the exported C functions via P/Invoke, passing raw pointers to term IDs, frequencies, and output buffers. The typical workflow is:

Call CreateEngine with hyperparameters.
Call AllocateDataMemory for the corpus size.
Call FeedInData or FeedInDataDense for each document.
Call InitializeBeforeTrain to set up model and samplers.
Call Train to run collapsed Gibbs sampling.
Call GetModelStat, GetWordTopic to serialize the model.
For inference: AllocateModelMemory, SetWordTopic to load a model, then InitializeBeforeTest, and TestOneDoc/Test.

Code Reference

Source Location

Repository: Dotnet_Machinelearning
File: src/Native/LdaNative/lda_engine.cpp (1013 lines)
File: src/Native/LdaNative/lda_engine.hpp (138 lines)
File: src/Native/LdaNative/lda_engine_export.cpp (109 lines)

Signature

namespace lda {
    class LdaEngine {
    public:
        LdaEngine(int numTopic, int numVocab, float alphaSum, float beta,
                  int numIter, int likelihoodInterval, int numThread,
                  int mhstep, int maxDocToken);

        void AllocateDataMemory(int num_document, int64_t corpus_size);
        void AllocateModelMemory(const LDADataBlock& data_block);
        void AllocateModelMemory(int num_vocabs, int num_topics, int64_t nonzero_num);
        void AllocateModelMemory(int num_vocabs, int num_topics,
                                 int64_t mem_block_size, int64_t alias_mem_block_size);

        int FeedInData(int* term_id, int* term_freq, int32_t term_num, int32_t vocab_size);
        int FeedInDataDense(int* term_freq, int32_t term_num, int32_t vocab_size);
        void SetAlphaSum(float avgDocLength);

        bool InitializeBeforeTrain();
        void InitializeBeforeTest();

        void Train(const char* pTrainOutput = nullptr);
        void Test(int32_t burnin_iter, float* pLoglikelihood);

        void TestOneDoc(int* term_id, int* term_freq, int32_t term_num,
                        int* pTopics, int* pProbs, int32_t& numTopicsMax,
                        int32_t numBurnIter, bool reset);
        void TestOneDocDense(int* term_freq, int32_t term_num,
                             int* pTopics, int* pProbs, int32_t& numTopicsMax,
                             int32_t numBurnIter, bool reset);

        void GetWordTopic(int32_t wordId, int32_t* pTopic, int32_t* pProb, int32_t& length);
        void SetWordTopic(int32_t wordId, int32_t* pTopic, int32_t* pProb, int32_t length);
        void GetDocTopic(int docID, int* pTopic, int* pProb, int32_t& numTopicReturn);
        void GetTopicSummary(int32_t topicId, int32_t* pWords, float* pProb, int32_t& length);
        void GetModelStat(int64_t& memBlockSize, int64_t& aliasMemBlockSize);

        bool ClearData();
        bool ClearModel();
    };
}

C Export API (lda_engine_export.cpp)

EXPORT_API(LdaEngine*) CreateEngine(int numTopic, int numVocab, float alphaSum,
    float beta, int numIter, int likelihoodInterval, int numThread,
    int mhstep, int maxDocToken);
EXPORT_API(void) DestroyEngine(LdaEngine* engine);
EXPORT_API(void) AllocateModelMemory(LdaEngine* engine, int numTopic, int numVocab,
    int64_t tableSize, int64_t aliasTableSize);
EXPORT_API(void) AllocateDataMemory(LdaEngine* engine, int num_document, int64_t corpus_size);
EXPORT_API(void) Train(LdaEngine* engine, const char* trainOutput);
EXPORT_API(void) Test(LdaEngine* engine, int32_t burnin_iter, float* pLoglikelihood);
EXPORT_API(void) CleanData(LdaEngine* engine);
EXPORT_API(void) CleanModel(LdaEngine* engine);
EXPORT_API(void) GetModelStat(LdaEngine* engine, int64_t& memBlockSize, int64_t& aliasMemBlockSize);
EXPORT_API(void) GetWordTopic(LdaEngine* engine, int32_t wordId, int32_t* pTopic,
    int32_t* pProb, int32_t& length);
EXPORT_API(void) SetWordTopic(LdaEngine* engine, int32_t wordId, int32_t* pTopic,
    int32_t* pProb, int32_t length);
EXPORT_API(void) GetTopicSummary(LdaEngine* engine, int32_t topicId, int32_t* pWords,
    float* pProb, int32_t& length);
EXPORT_API(void) SetAlphaSum(LdaEngine* engine, float avgDocLength);
EXPORT_API(int) FeedInData(LdaEngine* engine, int* term_id, int* term_freq,
    int32_t term_num, int32_t vocab_size);
EXPORT_API(int) FeedInDataDense(LdaEngine* engine, int* term_freq,
    int32_t term_num, int32_t vocab_size);
EXPORT_API(void) GetDocTopic(LdaEngine* engine, int docID, int* pTopic,
    int* pProb, int32_t& numTopicReturn);
EXPORT_API(void) TestOneDoc(LdaEngine* engine, int* term_id, int* term_freq,
    int32_t term_num, int* pTopics, int* pProbs, int32_t& numTopicsMax,
    int32_t numBurnIter, bool reset);
EXPORT_API(void) TestOneDocDense(LdaEngine* engine, int* term_freq, int32_t term_num,
    int* pTopics, int* pProbs, int32_t& numTopicsMax, int32_t numBurnIter, bool reset);
EXPORT_API(void) InitializeBeforeTrain(LdaEngine* engine);
EXPORT_API(void) InitializeBeforeTest(LdaEngine* engine);

Import

// P/Invoke from C# managed code (ML.NET LdaNative interop)
[DllImport("LdaNative")]
private static extern SafeLdaEngineHandle CreateEngine(
    int numTopic, int numVocab, float alphaSum, float beta,
    int numIter, int likelihoodInterval, int numThread,
    int mhstep, int maxDocToken);

[DllImport("LdaNative")]
private static extern void DestroyEngine(IntPtr engine);

[DllImport("LdaNative")]
private static extern void AllocateDataMemory(
    SafeLdaEngineHandle engine, int numDocument, long corpusSize);

[DllImport("LdaNative")]
private static extern int FeedInData(
    SafeLdaEngineHandle engine, int* termId, int* termFreq,
    int termNum, int vocabSize);

[DllImport("LdaNative")]
private static extern void InitializeBeforeTrain(SafeLdaEngineHandle engine);

[DllImport("LdaNative")]
private static extern void Train(SafeLdaEngineHandle engine, string trainOutput);

I/O Contract

Inputs

Name	Type	Required	Description
numTopic	int	Yes	Number of latent topics K. Controls model capacity.
numVocab	int	Yes	Vocabulary size V. Total number of unique terms.
alphaSum	float	Yes	Sum of Dirichlet document-topic prior. Multiplied by avgDocLength via SetAlphaSum.
beta	float	Yes	Dirichlet word-topic prior (symmetric). Typical values: 0.01-0.1. beta_sum = beta * V.
numIter	int	Yes	Number of Gibbs sampling iterations for training.
likelihoodInterval	int	Yes	Compute log-likelihood every N iterations. Set to -1 to disable.
numThread	int	Yes	Number of worker threads. If 0 or negative, defaults to hardware_concurrency() - 2.
mhstep	int	Yes	Number of Metropolis-Hastings steps per token in the Gibbs sampler.
maxDocToken	int	Yes	Maximum number of tokens per document. Used to pre-allocate per-thread document buffers (size = maxDocToken * 2 + 1).
term_id	int*	Yes (FeedInData)	Array of term IDs for a single document.
term_freq	int*	Yes	Array of term frequencies corresponding to term_id entries.
term_num	int32_t	Yes	Number of unique terms in the document being fed.

Outputs

Name	Type	Description
pTopics	int*	Output array of topic IDs for GetWordTopic/GetDocTopic/TestOneDoc.
pProbs	int/float	Output array of topic counts or probabilities corresponding to pTopics.
numTopicsMax	int32_t&	In/out: max topics to return on input, actual count on output.
pLoglikelihood	float*	Output array of per-iteration log-likelihood values from Test().
memBlockSize	int64_t&	Output: size of the model's word-topic memory block.
aliasMemBlockSize	int64_t&	Output: size of the alias table memory block.

Usage Examples

// Typical ML.NET LDA training workflow via P/Invoke
// Step 1: Create engine with hyperparameters
var engine = CreateEngine(
    numTopic: 100,    // K=100 topics
    numVocab: 50000,  // V=50000 vocabulary
    alphaSum: 100.0f, // will be adjusted by SetAlphaSum
    beta: 0.01f,      // symmetric Dirichlet prior
    numIter: 200,     // 200 Gibbs sampling iterations
    likelihoodInterval: 10, // log-likelihood every 10 iters
    numThread: 4,     // 4 worker threads
    mhstep: 4,        // 4 MH steps per token
    maxDocToken: 512  // max tokens per document
);

// Step 2: Allocate data memory
AllocateDataMemory(engine, numDocuments, corpusSize);

// Step 3: Feed in documents
foreach (var doc in documents)
{
    FeedInData(engine, doc.TermIds, doc.TermFreqs, doc.NumTerms, vocabSize);
}

// Step 4: Set alpha based on average document length
SetAlphaSum(engine, avgDocLength);

// Step 5: Initialize and train
InitializeBeforeTrain(engine);
Train(engine, null);

// Step 6: Extract trained model
long memSize, aliasMemSize;
GetModelStat(engine, ref memSize, ref aliasMemSize);

// Step 7: Single-document inference
int numTopics = 100;
TestOneDoc(engine, termIds, termFreqs, termNum,
           topicBuf, probBuf, ref numTopics, burnInIter: 100, reset: true);

DestroyEngine(engine);

Internal Architecture

Key Private Members

Member	Type	Description
K_	int32_t	Number of topics
V_	int32_t	Vocabulary size
beta_ / beta_sum_	float	Word-topic Dirichlet prior and its sum (beta * V)
alpha_sum_	float	Document-topic Dirichlet prior sum
global_word_topic_table_	vector<hybrid_map>	V-length array of word-topic count rows
global_alias_k_v_	vector<hybrid_alias_map>	V-length array of per-word alias tables
global_summary_row_	vector<int64_t>	K-length topic count summary (sum over all words)
data_block_	unique_ptr<LDADataBlock>	Holds all document-term data
model_block_	unique_ptr<LDAModelBlock>	Manages word-topic memory layout
samplers_	unique_ptr<unique_ptr<LightDocSampler>[]>	Per-thread Gibbs samplers
process_barrier_	unique_ptr<SimpleBarrier>	Thread synchronization barrier
samplerQueue_	unique_ptr<CBlockedIntQueue>	Thread-safe queue for TestOneDoc concurrency
document_buffer_	int32_t**	Per-thread document encoding buffers

Threading Model

Training and testing use a fork-join pattern: the main thread spawns N worker threads, each processing a disjoint partition of documents. Synchronization points use SimpleBarrier (barrier wait). The global word-topic table is updated in a partitioned manner: each thread owns a range of words (word_range_for_each_thread_), and word-topic deltas are sharded by word ID modulo num_threads. The global summary row uses a mutex for atomic updates.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment