Implementation:Dotnet Machinelearning LdaDataBlock

Knowledge Sources	Dotnet_Machinelearning
Domains	Topic_Modeling, Data_Management, NLP
Last Updated	2026-02-09 12:00 GMT

Overview

LDADataBlock manages the corpus-level document-term data using a flat buffer with offset-based indexing, providing efficient storage and retrieval of individual documents for multi-threaded LDA training and inference.

Description

The LDADataBlock class (in the lda namespace) serves as the in-memory representation of the entire document corpus. It stores all documents in a single contiguous int32_t buffer (documents_buffer_) with an offset array (offset_buffer_) that enables O(1) access to any document by index.

Memory layout:

documents_buffer_: A flat array of size corpus_size_. Each document is stored as a sequence of (word_id, topic_id) pairs, preceded by one unused int32_t (legacy header). So for a document with N total tokens, the storage is: [header][word0][topic0][word1][topic1]...[wordN-1][topicN-1], consuming 2*N + 1 int32_t values.
offset_buffer_: An array of size num_documents_ + 1. offset_buffer_[i] gives the starting byte position of document i in documents_buffer_. offset_buffer_[num_documents_] is the end of the last document.

Data ingestion:

Allocate(num_document, corpus_size): Pre-allocates both buffers. Called once before feeding data.
Add(term_id, term_freq, term_num): Adds a sparse document. For each term, the term frequency determines how many (word_id, 0) pairs are written (topic initialized to 0). The offset is recorded for the next document.
AddDense(term_freq, term_num): Adds a dense document where term IDs are implicit (0 to term_num-1). Term frequencies determine repetition count.

Document retrieval:

GetOneDoc(index): Returns a shared_ptr<LDADocument> wrapping a view into the documents_buffer_ slice for the given document index. The LDADocument provides Word(i), Topic(i), SetTopic(i, t), size(), and get_cursor() methods.

Thread partitioning:

Begin(thread_id) and End(thread_id) divide documents evenly across num_threads_ threads. The last thread gets any remainder: thread_id = num_threads_-1 handles documents from (num_threads_-1)*chunk to num_documents_.

Corpus size calculation: The caller (managed code) must pre-compute the total corpus_size as the sum of (2 * sum_of_term_frequencies + 1) across all documents, accounting for the per-document header and the word-topic pair encoding.

Usage

Created internally by LdaEngine. The managed C# code calls AllocateDataMemory then FeedInData/FeedInDataDense for each document. During training/testing, each thread processes its partition of documents by calling GetOneDoc in a loop.

Code Reference

Source Location

Repository: Dotnet_Machinelearning
File: src/Native/LdaNative/data_block.cpp (127 lines)
File: src/Native/LdaNative/data_block.h (70 lines)

Signature

namespace lda {
    class LDADataBlock {
    public:
        explicit LDADataBlock(int32_t num_threads);
        ~LDADataBlock();

        void Clear();
        void Allocate(const int32_t num_document, const int64_t corpus_size);

        int Add(int32_t* term_id, int32_t* term_freq, int32_t term_num);
        int AddDense(int32_t* term_freq, int32_t term_num);
        std::shared_ptr<LDADocument> GetOneDoc(int32_t index) const;

        int32_t num_documents() const;
        int32_t Begin(int32_t thread_id) const;
        int32_t End(int32_t thread_id) const;

    private:
        int32_t num_threads_;
        bool has_read_;
        int32_t index_document_;
        int64_t used_size_;
        int32_t num_documents_;
        size_t corpus_size_;
        int64_t* offset_buffer_;
        int32_t* documents_buffer_;
    };
}

Import

// LDADataBlock is an internal C++ class. Data is fed via exported C functions:
[DllImport("LdaNative")]
private static extern void AllocateDataMemory(
    SafeLdaEngineHandle engine, int numDocument, long corpusSize);

[DllImport("LdaNative")]
private static extern int FeedInData(
    SafeLdaEngineHandle engine, int* termId, int* termFreq,
    int termNum, int vocabSize);

[DllImport("LdaNative")]
private static extern int FeedInDataDense(
    SafeLdaEngineHandle engine, int* termFreq,
    int termNum, int vocabSize);

I/O Contract

Inputs

Name	Type	Required	Description
num_document	int32_t	Yes (Allocate)	Total number of documents in the corpus
corpus_size	int64_t	Yes (Allocate)	Total size of documents_buffer_ in int32_t elements
term_id	int32_t*	Yes (Add)	Array of unique term IDs for the document
term_freq	int32_t*	Yes	Array of frequencies for each term
term_num	int32_t	Yes	Number of unique terms in the document
index	int32_t	Yes (GetOneDoc)	0-based document index
thread_id	int32_t	Yes (Begin/End)	0-based thread index for partitioning

Outputs

Name	Type	Description
Add/AddDense return	int	Data length consumed (2 * total_token_count + 1)
GetOneDoc return	shared_ptr<LDADocument>	Document view with Word(), Topic(), SetTopic(), size() methods
num_documents()	int32_t	Total number of documents stored
Begin(tid)	int32_t	First document index for thread tid
End(tid)	int32_t	One-past-last document index for thread tid

Data Encoding Detail

Each document is encoded as follows in documents_buffer_:

// For a document with terms: {term_a: freq=2, term_b: freq=3}
// Layout: [unused_header] [term_a] [topic_0] [term_a] [topic_0]
//         [term_b] [topic_0] [term_b] [topic_0] [term_b] [topic_0]
// Total length = 1 + 2*5 = 11 int32_t values

// Add method:
int64_t idx = offset_buffer_[index_document_] + 1;  // skip header
for (int i = 0; i < term_num; ++i) {
    for (int j = 0; j < term_freq[i]; ++j) {
        documents_buffer_[idx++] = term_id[i];  // word
        documents_buffer_[idx++] = 0;           // topic (initialized to 0)
    }
}

The LDADocument wrapper interprets this layout, providing:

Word(i): Returns documents_buffer_[start + 1 + 2*i]
Topic(i): Returns documents_buffer_[start + 1 + 2*i + 1]
SetTopic(i, t): Sets documents_buffer_[start + 1 + 2*i + 1] = t
size(): Returns (end - start - 1) / 2

Usage Examples

// From C# managed code:
// 1. Allocate data memory for the corpus
AllocateDataMemory(engine, numDocuments: 10000, corpusSize: 5000000);

// 2. Feed each document
foreach (var doc in corpus) {
    fixed (int* pTermId = doc.TermIds, pTermFreq = doc.TermFreqs) {
        FeedInData(engine, pTermId, pTermFreq, doc.NumUniqueTerms, vocabSize);
    }
}

// 3. Initialize and train (data block is used internally)
InitializeBeforeTrain(engine);
Train(engine, null);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment