Implementation:Dotnet Machinelearning LdaDataBlock
| Knowledge Sources | |
|---|---|
| Domains | Topic_Modeling, Data_Management, NLP |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
LDADataBlock manages the corpus-level document-term data using a flat buffer with offset-based indexing, providing efficient storage and retrieval of individual documents for multi-threaded LDA training and inference.
Description
The LDADataBlock class (in the lda namespace) serves as the in-memory representation of the entire document corpus. It stores all documents in a single contiguous int32_t buffer (documents_buffer_) with an offset array (offset_buffer_) that enables O(1) access to any document by index.
Memory layout:
- documents_buffer_: A flat array of size corpus_size_. Each document is stored as a sequence of (word_id, topic_id) pairs, preceded by one unused int32_t (legacy header). So for a document with N total tokens, the storage is: [header][word0][topic0][word1][topic1]...[wordN-1][topicN-1], consuming 2*N + 1 int32_t values.
- offset_buffer_: An array of size num_documents_ + 1. offset_buffer_[i] gives the starting byte position of document i in documents_buffer_. offset_buffer_[num_documents_] is the end of the last document.
Data ingestion:
- Allocate(num_document, corpus_size): Pre-allocates both buffers. Called once before feeding data.
- Add(term_id, term_freq, term_num): Adds a sparse document. For each term, the term frequency determines how many (word_id, 0) pairs are written (topic initialized to 0). The offset is recorded for the next document.
- AddDense(term_freq, term_num): Adds a dense document where term IDs are implicit (0 to term_num-1). Term frequencies determine repetition count.
Document retrieval:
- GetOneDoc(index): Returns a shared_ptr<LDADocument> wrapping a view into the documents_buffer_ slice for the given document index. The LDADocument provides Word(i), Topic(i), SetTopic(i, t), size(), and get_cursor() methods.
Thread partitioning:
- Begin(thread_id) and End(thread_id) divide documents evenly across num_threads_ threads. The last thread gets any remainder: thread_id = num_threads_-1 handles documents from (num_threads_-1)*chunk to num_documents_.
Corpus size calculation: The caller (managed code) must pre-compute the total corpus_size as the sum of (2 * sum_of_term_frequencies + 1) across all documents, accounting for the per-document header and the word-topic pair encoding.
Usage
Created internally by LdaEngine. The managed C# code calls AllocateDataMemory then FeedInData/FeedInDataDense for each document. During training/testing, each thread processes its partition of documents by calling GetOneDoc in a loop.
Code Reference
Source Location
- Repository: Dotnet_Machinelearning
- File: src/Native/LdaNative/data_block.cpp (127 lines)
- File: src/Native/LdaNative/data_block.h (70 lines)
Signature
namespace lda {
class LDADataBlock {
public:
explicit LDADataBlock(int32_t num_threads);
~LDADataBlock();
void Clear();
void Allocate(const int32_t num_document, const int64_t corpus_size);
int Add(int32_t* term_id, int32_t* term_freq, int32_t term_num);
int AddDense(int32_t* term_freq, int32_t term_num);
std::shared_ptr<LDADocument> GetOneDoc(int32_t index) const;
int32_t num_documents() const;
int32_t Begin(int32_t thread_id) const;
int32_t End(int32_t thread_id) const;
private:
int32_t num_threads_;
bool has_read_;
int32_t index_document_;
int64_t used_size_;
int32_t num_documents_;
size_t corpus_size_;
int64_t* offset_buffer_;
int32_t* documents_buffer_;
};
}
Import
// LDADataBlock is an internal C++ class. Data is fed via exported C functions:
[DllImport("LdaNative")]
private static extern void AllocateDataMemory(
SafeLdaEngineHandle engine, int numDocument, long corpusSize);
[DllImport("LdaNative")]
private static extern int FeedInData(
SafeLdaEngineHandle engine, int* termId, int* termFreq,
int termNum, int vocabSize);
[DllImport("LdaNative")]
private static extern int FeedInDataDense(
SafeLdaEngineHandle engine, int* termFreq,
int termNum, int vocabSize);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| num_document | int32_t | Yes (Allocate) | Total number of documents in the corpus |
| corpus_size | int64_t | Yes (Allocate) | Total size of documents_buffer_ in int32_t elements |
| term_id | int32_t* | Yes (Add) | Array of unique term IDs for the document |
| term_freq | int32_t* | Yes | Array of frequencies for each term |
| term_num | int32_t | Yes | Number of unique terms in the document |
| index | int32_t | Yes (GetOneDoc) | 0-based document index |
| thread_id | int32_t | Yes (Begin/End) | 0-based thread index for partitioning |
Outputs
| Name | Type | Description |
|---|---|---|
| Add/AddDense return | int | Data length consumed (2 * total_token_count + 1) |
| GetOneDoc return | shared_ptr<LDADocument> | Document view with Word(), Topic(), SetTopic(), size() methods |
| num_documents() | int32_t | Total number of documents stored |
| Begin(tid) | int32_t | First document index for thread tid |
| End(tid) | int32_t | One-past-last document index for thread tid |
Data Encoding Detail
Each document is encoded as follows in documents_buffer_:
// For a document with terms: {term_a: freq=2, term_b: freq=3}
// Layout: [unused_header] [term_a] [topic_0] [term_a] [topic_0]
// [term_b] [topic_0] [term_b] [topic_0] [term_b] [topic_0]
// Total length = 1 + 2*5 = 11 int32_t values
// Add method:
int64_t idx = offset_buffer_[index_document_] + 1; // skip header
for (int i = 0; i < term_num; ++i) {
for (int j = 0; j < term_freq[i]; ++j) {
documents_buffer_[idx++] = term_id[i]; // word
documents_buffer_[idx++] = 0; // topic (initialized to 0)
}
}
The LDADocument wrapper interprets this layout, providing:
- Word(i): Returns documents_buffer_[start + 1 + 2*i]
- Topic(i): Returns documents_buffer_[start + 1 + 2*i + 1]
- SetTopic(i, t): Sets documents_buffer_[start + 1 + 2*i + 1] = t
- size(): Returns (end - start - 1) / 2
Usage Examples
// From C# managed code:
// 1. Allocate data memory for the corpus
AllocateDataMemory(engine, numDocuments: 10000, corpusSize: 5000000);
// 2. Feed each document
foreach (var doc in corpus) {
fixed (int* pTermId = doc.TermIds, pTermFreq = doc.TermFreqs) {
FeedInData(engine, pTermId, pTermFreq, doc.NumUniqueTerms, vocabSize);
}
}
// 3. Initialize and train (data block is used internally)
InitializeBeforeTrain(engine);
Train(engine, null);