Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Dotnet Machinelearning LdaModelBlock

From Leeroopedia


Knowledge Sources
Domains Topic_Modeling, Memory_Management, Data_Structures
Last Updated 2026-02-09 12:00 GMT

Overview

LDAModelBlock manages the memory layout and allocation for the LDA word-topic count matrix and its corresponding alias tables, using a hybrid dense/sparse representation per word based on term frequency.

Description

The LDAModelBlock class (in the lda namespace) is responsible for allocating and organizing the two large memory blocks that hold all word-topic data:

  • mem_block_: A single contiguous int32_t array storing all word-topic count data. Each word gets a slice of this block, sized according to whether it uses dense or sparse storage.
  • alias_mem_block_: A separate contiguous int32_t array storing all alias table data. Similarly partitioned per word.

Per-word metadata (WordEntry struct):

struct WordEntry {
    int32_t word_id_;         // Word index in vocabulary
    int64_t offset_;          // Start offset in mem_block_
    int64_t end_offset_;      // End offset in mem_block_
    int32_t capacity_;        // Dense: K, Sparse: power-of-2 hash table size
    int32_t is_dense_;        // 1 = dense array, 0 = sparse hash table

    int32_t tf;               // Term frequency (used for sizing decisions)
    int64_t alias_offset_;    // Start offset in alias_mem_block_
    int64_t alias_end_offset_;// End offset in alias_mem_block_
    int32_t alias_capacity_;  // Dense: K, Sparse: tf
    int32_t is_alias_dense_;  // 1 = dense alias, 0 = sparse alias
};

Sizing decisions (controlled by load_factor_ = 2 and sparse_factor_ = 5):

  • Word-topic row:
    • If tf >= num_topics_ / (2 * load_factor_): dense, capacity = K, row_size = K.
    • If 0 < tf < threshold: sparse, capacity = upper_bound(load_factor_ * tf) (next power of 2), row_size = 2 * capacity (keys + values).
    • If tf == 0: capacity = 0, row_size = 0.
  • Alias table row:
    • If tf >= (num_topics_ * 2) / 3: dense alias, capacity = K, row_size = 2*K (K pairs of [alias, boundary]).
    • If 0 < tf < threshold: sparse alias, capacity = tf, row_size = 3*tf (tf pairs of [alias, boundary] + tf indirection indices).
    • If tf == 0: capacity = 0, row_size = 0.

Three Init overloads:

  1. Init(num_vocabs, num_topics): Allocates only the dictionary; actual memory blocks are allocated later by InitModelBlockByTFS() after term frequencies are computed.
  2. Init(num_vocabs, num_topics, nonzero_num): Allocates memory blocks sized by total nonzero count. mem_block_size = 2 * upper_bound(2 * nonzero_num), alias_mem_block_size = 3 * nonzero_num.
  3. Init(num_vocabs, num_topics, mem_block_size, alias_mem_block_size): Allocates memory blocks with explicit sizes (used when restoring a saved model).

InitFromDataBlock(): Scans all documents in a LDADataBlock to compute per-word term frequencies, then calls InitModelBlockByTFS(false) to allocate with hybrid storage.

SetWordInfo(): Sets per-word metadata for the model loading path. Computes dense/sparse classification and offsets incrementally, allowing the caller to set word-topic data afterward.

GetModelStat(): Computes the required memory sizes for serializing the current model. Uses CountNonZero() to scan all word-topic rows, then GetModelSizeByTFS(true, ...) with fullSparse=true (since testing always uses sparse representation).

Row access:

  • get_row(word_id, external_buf): Returns a hybrid_map view of the word-topic row at mem_block_ + dict_[word_id].offset_.
  • get_alias_row(word_id): Returns a hybrid_alias_map view at alias_mem_block_ + dict_[word_id].alias_offset_.

Usage

Created and owned by LdaEngine. During training, InitFromDataBlock() or Init() allocates memory, then get_row() provides hybrid_map views to the global_word_topic_table_. During model save/load, GetModelStat() and SetWordInfo() manage serialization.

Code Reference

Source Location

Signature

namespace lda {
    struct WordEntry {
        int32_t word_id_;
        int64_t offset_;
        int64_t end_offset_;
        int32_t capacity_;
        int32_t is_dense_;
        int32_t tf;
        int64_t alias_offset_;
        int64_t alias_end_offset_;
        int32_t alias_capacity_;
        int32_t is_alias_dense_;
    };

    class LDAModelBlock {
    public:
        LDAModelBlock();
        ~LDAModelBlock();

        void Init(int32_t num_vocabs, int32_t num_topics);
        void Init(int32_t num_vocabs, int32_t num_topics, int64_t nonzero_num);
        void Init(int32_t num_vocabs, int32_t num_topics,
                  int64_t mem_block_size, int64_t alias_mem_block_size);

        void InitFromDataBlock(const LDADataBlock& data_block,
                               int32_t num_vocabs, int32_t num_topics);

        void SetWordInfo(int word_id, int32_t nonzero_num, bool fullSparse);
        void GetModelStat(int64_t& mem_block_size, int64_t& alias_mem_block_size);
        void Clear();

        hybrid_map get_row(int word_id, int32_t* external_buf);
        hybrid_alias_map get_alias_row(int word_id);

    private:
        void CountNonZero(std::vector<int32_t>& tfs);
        void InitModelBlockByTFS(bool fullSparse);
        void GetModelSizeByTFS(bool fullSparse, std::vector<int32_t>& tfs,
                               int64_t& mem_block_size, int64_t& alias_mem_block_size);

        int32_t num_vocabs_;
        int32_t num_topics_;
        WordEntry* dict_;           // V entries
        int32_t* mem_block_;        // word-topic count storage
        size_t mem_block_size_;
        int32_t* alias_mem_block_;  // alias table storage
        size_t alias_mem_block_size_;
        int64_t offset_;            // current write offset for SetWordInfo
        int64_t alias_offset_;      // current alias write offset for SetWordInfo

        const int32_t load_factor_ = 2;   // hash table load factor
        const int32_t sparse_factor_ = 5; // unused (legacy)
    };
}

Import

// LDAModelBlock is internal. Model data is accessed through exported C functions:
[DllImport("LdaNative")]
private static extern void AllocateModelMemory(
    SafeLdaEngineHandle engine, int numTopic, int numVocab,
    long tableSize, long aliasTableSize);

[DllImport("LdaNative")]
private static extern void GetModelStat(
    SafeLdaEngineHandle engine, ref long memBlockSize, ref long aliasMemBlockSize);

[DllImport("LdaNative")]
private static extern void GetWordTopic(
    SafeLdaEngineHandle engine, int wordId, int* pTopic, int* pProb, ref int length);

[DllImport("LdaNative")]
private static extern void SetWordTopic(
    SafeLdaEngineHandle engine, int wordId, int* pTopic, int* pProb, int length);

I/O Contract

Inputs

Name Type Required Description
num_vocabs int32_t Yes Vocabulary size V
num_topics int32_t Yes Number of topics K
nonzero_num int64_t Yes (Init overload 2) Total nonzero entries across all word-topic rows
mem_block_size int64_t Yes (Init overload 3) Explicit word-topic memory size (from saved model)
alias_mem_block_size int64_t Yes (Init overload 3) Explicit alias table memory size (from saved model)
data_block const LDADataBlock& Yes (InitFromDataBlock) Corpus data for computing term frequencies
word_id int Yes (SetWordInfo/get_row) 0-based word index
nonzero_num (SetWordInfo) int32_t Yes Number of topics with nonzero count for this word
fullSparse bool Yes (SetWordInfo) If true, forces sparse storage for all words

Outputs

Name Type Description
get_row() hybrid_map View of the word-topic count data for a specific word
get_alias_row() hybrid_alias_map View of the alias table for a specific word
GetModelStat() int64_t&, int64_t& Required memory sizes for model serialization

Memory Layout Diagram

mem_block_:
+--------------------------------------------+
| Word 0 (dense, K entries)                  |
+--------------------------------------------+
| Word 1 (sparse, 2*cap entries: keys|vals)  |
+--------------------------------------------+
| Word 2 (dense, K entries)                  |
+--------------------------------------------+
| ...                                        |
+--------------------------------------------+
| Word V-1                                   |
+--------------------------------------------+

alias_mem_block_:
+--------------------------------------------+
| Word 0 alias (dense, 2*K entries: kv pairs)|
+--------------------------------------------+
| Word 1 alias (sparse, 3*tf: kv|idx)       |
+--------------------------------------------+
| ...                                        |
+--------------------------------------------+

Usage Examples

// Training path: initialize from data
model_block_->InitFromDataBlock(*data_block_, V_, K_);
for (int i = 0; i < V_; ++i) {
    global_word_topic_table_[i] = model_block_->get_row(i, nullptr);
    global_alias_k_v_[i] = model_block_->get_alias_row(i);
}

// Model save path:
int64_t memSize, aliasSize;
model_block_->GetModelStat(memSize, aliasSize);
// memSize and aliasSize are saved to disk

// Model load path:
model_block_->Init(V_, K_, savedMemSize, savedAliasSize);
for (int w = 0; w < V_; ++w) {
    model_block_->SetWordInfo(w, nonzero_count[w], true);
    global_word_topic_table_[w] = model_block_->get_row(w, nullptr);
    for (int i = 0; i < nonzero_count[w]; ++i) {
        global_word_topic_table_[w].inc(topics[i], probs[i]);
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment