Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Dotnet Machinelearning LdaDataBlock

From Leeroopedia


Knowledge Sources
Domains Topic_Modeling, Data_Management, NLP
Last Updated 2026-02-09 12:00 GMT

Overview

LDADataBlock manages the corpus-level document-term data using a flat buffer with offset-based indexing, providing efficient storage and retrieval of individual documents for multi-threaded LDA training and inference.

Description

The LDADataBlock class (in the lda namespace) serves as the in-memory representation of the entire document corpus. It stores all documents in a single contiguous int32_t buffer (documents_buffer_) with an offset array (offset_buffer_) that enables O(1) access to any document by index.

Memory layout:

  • documents_buffer_: A flat array of size corpus_size_. Each document is stored as a sequence of (word_id, topic_id) pairs, preceded by one unused int32_t (legacy header). So for a document with N total tokens, the storage is: [header][word0][topic0][word1][topic1]...[wordN-1][topicN-1], consuming 2*N + 1 int32_t values.
  • offset_buffer_: An array of size num_documents_ + 1. offset_buffer_[i] gives the starting byte position of document i in documents_buffer_. offset_buffer_[num_documents_] is the end of the last document.

Data ingestion:

  • Allocate(num_document, corpus_size): Pre-allocates both buffers. Called once before feeding data.
  • Add(term_id, term_freq, term_num): Adds a sparse document. For each term, the term frequency determines how many (word_id, 0) pairs are written (topic initialized to 0). The offset is recorded for the next document.
  • AddDense(term_freq, term_num): Adds a dense document where term IDs are implicit (0 to term_num-1). Term frequencies determine repetition count.

Document retrieval:

  • GetOneDoc(index): Returns a shared_ptr<LDADocument> wrapping a view into the documents_buffer_ slice for the given document index. The LDADocument provides Word(i), Topic(i), SetTopic(i, t), size(), and get_cursor() methods.

Thread partitioning:

  • Begin(thread_id) and End(thread_id) divide documents evenly across num_threads_ threads. The last thread gets any remainder: thread_id = num_threads_-1 handles documents from (num_threads_-1)*chunk to num_documents_.

Corpus size calculation: The caller (managed code) must pre-compute the total corpus_size as the sum of (2 * sum_of_term_frequencies + 1) across all documents, accounting for the per-document header and the word-topic pair encoding.

Usage

Created internally by LdaEngine. The managed C# code calls AllocateDataMemory then FeedInData/FeedInDataDense for each document. During training/testing, each thread processes its partition of documents by calling GetOneDoc in a loop.

Code Reference

Source Location

Signature

namespace lda {
    class LDADataBlock {
    public:
        explicit LDADataBlock(int32_t num_threads);
        ~LDADataBlock();

        void Clear();
        void Allocate(const int32_t num_document, const int64_t corpus_size);

        int Add(int32_t* term_id, int32_t* term_freq, int32_t term_num);
        int AddDense(int32_t* term_freq, int32_t term_num);
        std::shared_ptr<LDADocument> GetOneDoc(int32_t index) const;

        int32_t num_documents() const;
        int32_t Begin(int32_t thread_id) const;
        int32_t End(int32_t thread_id) const;

    private:
        int32_t num_threads_;
        bool has_read_;
        int32_t index_document_;
        int64_t used_size_;
        int32_t num_documents_;
        size_t corpus_size_;
        int64_t* offset_buffer_;
        int32_t* documents_buffer_;
    };
}

Import

// LDADataBlock is an internal C++ class. Data is fed via exported C functions:
[DllImport("LdaNative")]
private static extern void AllocateDataMemory(
    SafeLdaEngineHandle engine, int numDocument, long corpusSize);

[DllImport("LdaNative")]
private static extern int FeedInData(
    SafeLdaEngineHandle engine, int* termId, int* termFreq,
    int termNum, int vocabSize);

[DllImport("LdaNative")]
private static extern int FeedInDataDense(
    SafeLdaEngineHandle engine, int* termFreq,
    int termNum, int vocabSize);

I/O Contract

Inputs

Name Type Required Description
num_document int32_t Yes (Allocate) Total number of documents in the corpus
corpus_size int64_t Yes (Allocate) Total size of documents_buffer_ in int32_t elements
term_id int32_t* Yes (Add) Array of unique term IDs for the document
term_freq int32_t* Yes Array of frequencies for each term
term_num int32_t Yes Number of unique terms in the document
index int32_t Yes (GetOneDoc) 0-based document index
thread_id int32_t Yes (Begin/End) 0-based thread index for partitioning

Outputs

Name Type Description
Add/AddDense return int Data length consumed (2 * total_token_count + 1)
GetOneDoc return shared_ptr<LDADocument> Document view with Word(), Topic(), SetTopic(), size() methods
num_documents() int32_t Total number of documents stored
Begin(tid) int32_t First document index for thread tid
End(tid) int32_t One-past-last document index for thread tid

Data Encoding Detail

Each document is encoded as follows in documents_buffer_:

// For a document with terms: {term_a: freq=2, term_b: freq=3}
// Layout: [unused_header] [term_a] [topic_0] [term_a] [topic_0]
//         [term_b] [topic_0] [term_b] [topic_0] [term_b] [topic_0]
// Total length = 1 + 2*5 = 11 int32_t values

// Add method:
int64_t idx = offset_buffer_[index_document_] + 1;  // skip header
for (int i = 0; i < term_num; ++i) {
    for (int j = 0; j < term_freq[i]; ++j) {
        documents_buffer_[idx++] = term_id[i];  // word
        documents_buffer_[idx++] = 0;           // topic (initialized to 0)
    }
}

The LDADocument wrapper interprets this layout, providing:

  • Word(i): Returns documents_buffer_[start + 1 + 2*i]
  • Topic(i): Returns documents_buffer_[start + 1 + 2*i + 1]
  • SetTopic(i, t): Sets documents_buffer_[start + 1 + 2*i + 1] = t
  • size(): Returns (end - start - 1) / 2

Usage Examples

// From C# managed code:
// 1. Allocate data memory for the corpus
AllocateDataMemory(engine, numDocuments: 10000, corpusSize: 5000000);

// 2. Feed each document
foreach (var doc in corpus) {
    fixed (int* pTermId = doc.TermIds, pTermFreq = doc.TermFreqs) {
        FeedInData(engine, pTermId, pTermFreq, doc.NumUniqueTerms, vocabSize);
    }
}

// 3. Initialize and train (data block is used internally)
InitializeBeforeTrain(engine);
Train(engine, null);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment