Implementation:Dotnet Machinelearning LdaHybridMap
| Knowledge Sources | |
|---|---|
| Domains | Topic_Modeling, Data_Structures, Hash_Tables |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
hybrid_map is an adaptive dense/sparse hash table that stores word-topic counts for a single word in the LDA model, switching between a dense array and a quadratic-probing hash table based on the word's term frequency.
Description
The hybrid_map class (in the lda namespace) provides a unified interface for accessing topic counts n(w,k) for a single word w across all K topics. The global word-topic matrix in LdaEngine is stored as a vector<hybrid_map> of size V (vocabulary size), where each element represents one row.
Two storage modes:
Dense mode (is_dense_ = 1):
- The memory_ pointer points to a contiguous array of K int32_t values, where memory_[k] = n(w,k).
- inc(key, delta): Simply does memory_[key] += delta.
- operator[](key): Returns memory_[key] (or 0 if capacity is 0).
- Used for "hot" words where tf >= num_topics_ / (2 * load_factor_), meaning the word appears in enough topics that a dense representation is more efficient.
Sparse mode (is_dense_ = 0):
- The memory_ pointer points to a flat block of 2 * capacity_ int32_t values, split into key_ (first half) and value_ (second half).
- Keys are stored as internal_key = topic_id + 1 to distinguish from the empty_key_ (0). The deleted_key_ is -1.
- Quadratic probing: Defined by the macro JUMP_(key, num_probes) = num_probes. The probe sequence is: idx = key % capacity_, then idx = (idx + probe_count) & (capacity_ - 1).
- inc(key, delta): Finds the key's position; if found, updates the value and deletes the entry if it becomes zero (setting key to deleted_key_). If deleted keys exceed 5% of capacity (num_deleted_key_ * 20 > capacity_), triggers rehashing(). If not found, inserts a new entry.
- Automatic rehashing: When too many tombstones accumulate, rehashing() copies the table to external_rehash_buf_, clears memory_, and re-inserts all live entries. This buffer is allocated per-thread to avoid contention.
- sorted_rehashing(): A more expensive variant using std::map to re-insert entries in sorted key order, used during model loading.
- Capacity requirement: capacity_ must be a power of 2 (for the bitwise AND masking in probing), and should be at least 2x the expected number of entries (load_factor_ = 2).
Memory ownership: The hybrid_map does not own its memory. The backing store is allocated by LDAModelBlock as a single large mem_block_ array. Each hybrid_map gets a slice via pointer arithmetic: mem_block_ + dict_[word_id].offset_.
Usage
Every word in the vocabulary has one hybrid_map in the global_word_topic_table_. During Gibbs sampling, the sampler calls inc(topic, +1) and inc(topic, -1) to update counts. The operator[] is used to query counts for MH acceptance ratio computation. The build_alias_table in hybrid_alias_map iterates over a hybrid_map's entries to compute proportions.
Code Reference
Source Location
- Repository: Dotnet_Machinelearning
- File: src/Native/LdaNative/hybrid_map.cpp (142 lines)
- File: src/Native/LdaNative/hybrid_map.h (238 lines)
Signature
#define JUMP_(key, num_probes) ( num_probes ) // Quadratic probing
#define ILLEGAL_BUCKET -1
namespace lda {
class hybrid_map {
friend class hybrid_alias_map;
public:
hybrid_map();
hybrid_map(int32_t* memory, int32_t is_dense, int32_t capacity,
int32_t num_deleted_key, int32_t* external_rehash_buf);
hybrid_map(const hybrid_map& other);
hybrid_map& operator=(const hybrid_map& other);
void clear();
std::string DumpString() const;
void sorted_rehashing();
int32_t nonzero_num() const;
bool is_dense() const;
int32_t capacity() const;
int32_t* memory() const;
int32_t* key() const;
int32_t* value() const;
void rehashing();
void inc(int32_t key, int32_t delta);
int32_t operator[](int32_t key) const;
private:
std::pair<int32_t, int32_t> find_position(const int32_t key) const;
int32_t* memory_;
int32_t is_dense_;
int32_t* key_;
int32_t* value_;
int32_t capacity_;
int32_t empty_key_; // = 0
int32_t deleted_key_; // = -1
int32_t num_deleted_key_;
int32_t* external_rehash_buf_;
};
}
Import
// hybrid_map is an internal C++ class; not directly exposed via P/Invoke.
// The managed code accesses word-topic data through LdaEngine's
// GetWordTopic/SetWordTopic exported functions.
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| memory | int32_t* | Yes | Pointer to backing memory (owned by LDAModelBlock) |
| is_dense | int32_t | Yes | 1 for dense array mode, 0 for sparse hash table mode |
| capacity | int32_t | Yes | Dense: array length (K). Sparse: hash table size (power of 2, >= 2 * expected entries) |
| num_deleted_key | int32_t | No | Initial count of tombstone entries (typically 0) |
| external_rehash_buf | int32_t* | No | Per-thread buffer of size 2*capacity for rehashing (sparse mode only) |
| key (inc) | int32_t | Yes | Topic ID to increment (0-based) |
| delta (inc) | int32_t | Yes | Count change (+1 or -1 typically) |
Outputs
| Name | Type | Description |
|---|---|---|
| operator[] return | int32_t | Count n(w,k) for the queried topic k, or 0 if not present |
| nonzero_num() | int32_t | Number of topics with non-zero counts for this word |
| is_dense() | bool | Whether this row uses dense storage |
| DumpString() | string | Human-readable representation of all nonzero entries as "topic:count" pairs |
Probing Algorithm
The find_position() method implements quadratic probing with tombstone handling:
// Returns pair<found_pos, insert_pos>:
// found_pos != ILLEGAL_BUCKET => key exists at found_pos
// found_pos == ILLEGAL_BUCKET => key not found, can insert at insert_pos
std::pair<int32_t, int32_t> find_position(const int32_t key) const {
int num_probes = 0;
int32_t idx = key % capacity_;
int32_t insert_pos = ILLEGAL_BUCKET;
while (1) {
if (key_[idx] == empty_key_) { // empty slot
return {ILLEGAL_BUCKET, insert_pos != ILLEGAL_BUCKET ? insert_pos : idx};
}
else if (key_[idx] == deleted_key_) { // tombstone
if (insert_pos == ILLEGAL_BUCKET) insert_pos = idx;
}
else if (key_[idx] == key) { // found
return {idx, ILLEGAL_BUCKET};
}
++num_probes;
idx = (idx + JUMP_(key, num_probes)) & (capacity_ - 1);
}
}
Usage Examples
// In LightDocSampler -- increment/decrement word-topic counts:
word_topic_table_[word].inc(old_topic, -1); // decrement old topic
word_topic_table_[word].inc(new_topic, +1); // increment new topic
// Query count for acceptance ratio:
int32_t n_kw = word_topic_table_[word][topic];
// Check if word has any topic assignments:
if (word_topic_table_[word].capacity() == 0) {
// word not in model
}
Related Pages
- Principle:Dotnet_Machinelearning_Hybrid_Dense_Sparse_Storage
- Implementation:Dotnet_Machinelearning_LdaEngine
- Implementation:Dotnet_Machinelearning_LdaDocumentSampler
- Implementation:Dotnet_Machinelearning_LdaHybridAliasMap
- Implementation:Dotnet_Machinelearning_LdaModelBlock
- Environment:Dotnet_Machinelearning_Native_Build_Toolchain