Implementation:Dotnet Machinelearning LdaEngine
| Knowledge Sources | |
|---|---|
| Domains | Topic_Modeling, NLP, Machine_Learning |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
LdaEngine is the top-level native C++ class that orchestrates the entire Latent Dirichlet Allocation pipeline, including data ingestion, multi-threaded collapsed Gibbs sampling for training and inference, model serialization, and log-likelihood evaluation.
Description
The LdaEngine class lives in the lda namespace and serves as the single entry point for the ML.NET LDA implementation. It manages the complete lifecycle of topic model training and inference:
- Data management: Accepts sparse or dense document-term data through FeedInData() and FeedInDataDense(), storing it in an internal LDADataBlock.
- Model management: Allocates and manages the word-topic count matrix via LDAModelBlock, supporting three overloads of AllocateModelMemory() for different initialization scenarios (from data block, from nonzero count, or from explicit memory sizes).
- Multi-threaded training: The Train() method spawns num_threads_ worker threads, each executing Training_Thread(). Each iteration involves: (1) building alias tables for the beta smoothing term and per-word distributions, (2) sampling all documents in the thread's partition via LightDocSampler::SampleOneDoc(), and (3) synchronizing word-topic deltas across threads using a barrier and mutex-protected global summary update.
- Multi-threaded testing: The Test() method similarly spawns threads running Testing_Thread(), performing burn-in iterations of inference using LightDocSampler::InferOneDoc() while computing per-iteration log-likelihood.
- Single-document inference: TestOneDoc() and TestOneDocDense() provide thread-safe single-document inference by drawing samplers from a CBlockedIntQueue, enabling concurrent calls from multiple C# threads.
- Thread affinity: Training and testing threads set CPU affinity masks (platform-specific: Windows SetThreadAffinityMask, macOS THREAD_AFFINITY_POLICY, Linux sched_setaffinity) for improved cache locality.
- Alpha adaptation: SetAlphaSum() multiplies the per-topic alpha by average document length, while LightDocSampler::AdaptAlphaSum() adjusts alpha between training (minimum 100) and inference (maximum 1) regimes.
- Log-likelihood evaluation: EvalLogLikelihood() computes the complete data log-likelihood by summing document-topic, word-topic, and normalization terms across threads.
The native library is exposed to ML.NET managed code through a set of C-linkage exports defined in lda_engine_export.cpp, using the EXPORT_API macro for cross-platform shared library symbol visibility.
Usage
This engine is used when ML.NET's LatentDirichletAllocationEstimator needs to train or apply an LDA topic model. The managed C# code calls the exported C functions via P/Invoke, passing raw pointers to term IDs, frequencies, and output buffers. The typical workflow is:
- Call CreateEngine with hyperparameters.
- Call AllocateDataMemory for the corpus size.
- Call FeedInData or FeedInDataDense for each document.
- Call InitializeBeforeTrain to set up model and samplers.
- Call Train to run collapsed Gibbs sampling.
- Call GetModelStat, GetWordTopic to serialize the model.
- For inference: AllocateModelMemory, SetWordTopic to load a model, then InitializeBeforeTest, and TestOneDoc/Test.
Code Reference
Source Location
- Repository: Dotnet_Machinelearning
- File: src/Native/LdaNative/lda_engine.cpp (1013 lines)
- File: src/Native/LdaNative/lda_engine.hpp (138 lines)
- File: src/Native/LdaNative/lda_engine_export.cpp (109 lines)
Signature
namespace lda {
class LdaEngine {
public:
LdaEngine(int numTopic, int numVocab, float alphaSum, float beta,
int numIter, int likelihoodInterval, int numThread,
int mhstep, int maxDocToken);
void AllocateDataMemory(int num_document, int64_t corpus_size);
void AllocateModelMemory(const LDADataBlock& data_block);
void AllocateModelMemory(int num_vocabs, int num_topics, int64_t nonzero_num);
void AllocateModelMemory(int num_vocabs, int num_topics,
int64_t mem_block_size, int64_t alias_mem_block_size);
int FeedInData(int* term_id, int* term_freq, int32_t term_num, int32_t vocab_size);
int FeedInDataDense(int* term_freq, int32_t term_num, int32_t vocab_size);
void SetAlphaSum(float avgDocLength);
bool InitializeBeforeTrain();
void InitializeBeforeTest();
void Train(const char* pTrainOutput = nullptr);
void Test(int32_t burnin_iter, float* pLoglikelihood);
void TestOneDoc(int* term_id, int* term_freq, int32_t term_num,
int* pTopics, int* pProbs, int32_t& numTopicsMax,
int32_t numBurnIter, bool reset);
void TestOneDocDense(int* term_freq, int32_t term_num,
int* pTopics, int* pProbs, int32_t& numTopicsMax,
int32_t numBurnIter, bool reset);
void GetWordTopic(int32_t wordId, int32_t* pTopic, int32_t* pProb, int32_t& length);
void SetWordTopic(int32_t wordId, int32_t* pTopic, int32_t* pProb, int32_t length);
void GetDocTopic(int docID, int* pTopic, int* pProb, int32_t& numTopicReturn);
void GetTopicSummary(int32_t topicId, int32_t* pWords, float* pProb, int32_t& length);
void GetModelStat(int64_t& memBlockSize, int64_t& aliasMemBlockSize);
bool ClearData();
bool ClearModel();
};
}
C Export API (lda_engine_export.cpp)
EXPORT_API(LdaEngine*) CreateEngine(int numTopic, int numVocab, float alphaSum,
float beta, int numIter, int likelihoodInterval, int numThread,
int mhstep, int maxDocToken);
EXPORT_API(void) DestroyEngine(LdaEngine* engine);
EXPORT_API(void) AllocateModelMemory(LdaEngine* engine, int numTopic, int numVocab,
int64_t tableSize, int64_t aliasTableSize);
EXPORT_API(void) AllocateDataMemory(LdaEngine* engine, int num_document, int64_t corpus_size);
EXPORT_API(void) Train(LdaEngine* engine, const char* trainOutput);
EXPORT_API(void) Test(LdaEngine* engine, int32_t burnin_iter, float* pLoglikelihood);
EXPORT_API(void) CleanData(LdaEngine* engine);
EXPORT_API(void) CleanModel(LdaEngine* engine);
EXPORT_API(void) GetModelStat(LdaEngine* engine, int64_t& memBlockSize, int64_t& aliasMemBlockSize);
EXPORT_API(void) GetWordTopic(LdaEngine* engine, int32_t wordId, int32_t* pTopic,
int32_t* pProb, int32_t& length);
EXPORT_API(void) SetWordTopic(LdaEngine* engine, int32_t wordId, int32_t* pTopic,
int32_t* pProb, int32_t length);
EXPORT_API(void) GetTopicSummary(LdaEngine* engine, int32_t topicId, int32_t* pWords,
float* pProb, int32_t& length);
EXPORT_API(void) SetAlphaSum(LdaEngine* engine, float avgDocLength);
EXPORT_API(int) FeedInData(LdaEngine* engine, int* term_id, int* term_freq,
int32_t term_num, int32_t vocab_size);
EXPORT_API(int) FeedInDataDense(LdaEngine* engine, int* term_freq,
int32_t term_num, int32_t vocab_size);
EXPORT_API(void) GetDocTopic(LdaEngine* engine, int docID, int* pTopic,
int* pProb, int32_t& numTopicReturn);
EXPORT_API(void) TestOneDoc(LdaEngine* engine, int* term_id, int* term_freq,
int32_t term_num, int* pTopics, int* pProbs, int32_t& numTopicsMax,
int32_t numBurnIter, bool reset);
EXPORT_API(void) TestOneDocDense(LdaEngine* engine, int* term_freq, int32_t term_num,
int* pTopics, int* pProbs, int32_t& numTopicsMax, int32_t numBurnIter, bool reset);
EXPORT_API(void) InitializeBeforeTrain(LdaEngine* engine);
EXPORT_API(void) InitializeBeforeTest(LdaEngine* engine);
Import
// P/Invoke from C# managed code (ML.NET LdaNative interop)
[DllImport("LdaNative")]
private static extern SafeLdaEngineHandle CreateEngine(
int numTopic, int numVocab, float alphaSum, float beta,
int numIter, int likelihoodInterval, int numThread,
int mhstep, int maxDocToken);
[DllImport("LdaNative")]
private static extern void DestroyEngine(IntPtr engine);
[DllImport("LdaNative")]
private static extern void AllocateDataMemory(
SafeLdaEngineHandle engine, int numDocument, long corpusSize);
[DllImport("LdaNative")]
private static extern int FeedInData(
SafeLdaEngineHandle engine, int* termId, int* termFreq,
int termNum, int vocabSize);
[DllImport("LdaNative")]
private static extern void InitializeBeforeTrain(SafeLdaEngineHandle engine);
[DllImport("LdaNative")]
private static extern void Train(SafeLdaEngineHandle engine, string trainOutput);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| numTopic | int | Yes | Number of latent topics K. Controls model capacity. |
| numVocab | int | Yes | Vocabulary size V. Total number of unique terms. |
| alphaSum | float | Yes | Sum of Dirichlet document-topic prior. Multiplied by avgDocLength via SetAlphaSum. |
| beta | float | Yes | Dirichlet word-topic prior (symmetric). Typical values: 0.01-0.1. beta_sum = beta * V. |
| numIter | int | Yes | Number of Gibbs sampling iterations for training. |
| likelihoodInterval | int | Yes | Compute log-likelihood every N iterations. Set to -1 to disable. |
| numThread | int | Yes | Number of worker threads. If 0 or negative, defaults to hardware_concurrency() - 2. |
| mhstep | int | Yes | Number of Metropolis-Hastings steps per token in the Gibbs sampler. |
| maxDocToken | int | Yes | Maximum number of tokens per document. Used to pre-allocate per-thread document buffers (size = maxDocToken * 2 + 1). |
| term_id | int* | Yes (FeedInData) | Array of term IDs for a single document. |
| term_freq | int* | Yes | Array of term frequencies corresponding to term_id entries. |
| term_num | int32_t | Yes | Number of unique terms in the document being fed. |
Outputs
| Name | Type | Description |
|---|---|---|
| pTopics | int* | Output array of topic IDs for GetWordTopic/GetDocTopic/TestOneDoc. |
| pProbs | int*/float* | Output array of topic counts or probabilities corresponding to pTopics. |
| numTopicsMax | int32_t& | In/out: max topics to return on input, actual count on output. |
| pLoglikelihood | float* | Output array of per-iteration log-likelihood values from Test(). |
| memBlockSize | int64_t& | Output: size of the model's word-topic memory block. |
| aliasMemBlockSize | int64_t& | Output: size of the alias table memory block. |
Usage Examples
// Typical ML.NET LDA training workflow via P/Invoke
// Step 1: Create engine with hyperparameters
var engine = CreateEngine(
numTopic: 100, // K=100 topics
numVocab: 50000, // V=50000 vocabulary
alphaSum: 100.0f, // will be adjusted by SetAlphaSum
beta: 0.01f, // symmetric Dirichlet prior
numIter: 200, // 200 Gibbs sampling iterations
likelihoodInterval: 10, // log-likelihood every 10 iters
numThread: 4, // 4 worker threads
mhstep: 4, // 4 MH steps per token
maxDocToken: 512 // max tokens per document
);
// Step 2: Allocate data memory
AllocateDataMemory(engine, numDocuments, corpusSize);
// Step 3: Feed in documents
foreach (var doc in documents)
{
FeedInData(engine, doc.TermIds, doc.TermFreqs, doc.NumTerms, vocabSize);
}
// Step 4: Set alpha based on average document length
SetAlphaSum(engine, avgDocLength);
// Step 5: Initialize and train
InitializeBeforeTrain(engine);
Train(engine, null);
// Step 6: Extract trained model
long memSize, aliasMemSize;
GetModelStat(engine, ref memSize, ref aliasMemSize);
// Step 7: Single-document inference
int numTopics = 100;
TestOneDoc(engine, termIds, termFreqs, termNum,
topicBuf, probBuf, ref numTopics, burnInIter: 100, reset: true);
DestroyEngine(engine);
Internal Architecture
Key Private Members
| Member | Type | Description |
|---|---|---|
| K_ | int32_t | Number of topics |
| V_ | int32_t | Vocabulary size |
| beta_ / beta_sum_ | float | Word-topic Dirichlet prior and its sum (beta * V) |
| alpha_sum_ | float | Document-topic Dirichlet prior sum |
| global_word_topic_table_ | vector<hybrid_map> | V-length array of word-topic count rows |
| global_alias_k_v_ | vector<hybrid_alias_map> | V-length array of per-word alias tables |
| global_summary_row_ | vector<int64_t> | K-length topic count summary (sum over all words) |
| data_block_ | unique_ptr<LDADataBlock> | Holds all document-term data |
| model_block_ | unique_ptr<LDAModelBlock> | Manages word-topic memory layout |
| samplers_ | unique_ptr<unique_ptr<LightDocSampler>[]> | Per-thread Gibbs samplers |
| process_barrier_ | unique_ptr<SimpleBarrier> | Thread synchronization barrier |
| samplerQueue_ | unique_ptr<CBlockedIntQueue> | Thread-safe queue for TestOneDoc concurrency |
| document_buffer_ | int32_t** | Per-thread document encoding buffers |
Threading Model
Training and testing use a fork-join pattern: the main thread spawns N worker threads, each processing a disjoint partition of documents. Synchronization points use SimpleBarrier (barrier wait). The global word-topic table is updated in a partitioned manner: each thread owns a range of words (word_range_for_each_thread_), and word-topic deltas are sharded by word ID modulo num_threads. The global summary row uses a mutex for atomic updates.
Related Pages
- Principle:Dotnet_Machinelearning_Latent_Dirichlet_Allocation
- Principle:Dotnet_Machinelearning_Alias_Method_Sampling
- Implementation:Dotnet_Machinelearning_LdaDocumentSampler
- Implementation:Dotnet_Machinelearning_LdaDataBlock
- Implementation:Dotnet_Machinelearning_LdaModelBlock
- Implementation:Dotnet_Machinelearning_LdaHybridMap
- Implementation:Dotnet_Machinelearning_LdaHybridAliasMap
- Implementation:Dotnet_Machinelearning_AliasMultinomialRng
- Environment:Dotnet_Machinelearning_Native_Build_Toolchain