Implementation:Dotnet Machinelearning LdaLightHashMap
| Knowledge Sources | |
|---|---|
| Domains | Topic_Modeling, Data_Structures, Hash_Tables |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
light_hash_map is a lightweight, fixed-capacity hash table with quadratic probing, used to store local document-topic counts during LDA Gibbs sampling and inference.
Description
The light_hash_map class (in the lda namespace) provides a minimal hash table optimized for the specific use case of tracking per-document topic counts n(d,k) during LDA inference. It borrows design ideas from Google's dense_hash_map but with significant simplifications:
Design constraints:
- Fixed capacity: The table never resizes. Capacity must be set at construction and should be at least twice the maximum number of entries to maintain a low load factor.
- Power-of-2 capacity: Required for efficient modular arithmetic using bitwise AND.
- int32_t key-value pairs only: No generic type support; both keys and values are 32-bit integers.
- Optional memory ownership: Can allocate its own memory (own_memory_ = true) or use an external memory block via set_memory().
- Fixed sentinel values: empty_key_ = 0, deleted_key_ = -2. Keys are stored as internal_key = topic_id + 1 to avoid collision with the empty sentinel.
Operations:
- inc(key, delta): Finds the key's position via quadratic probing. If found, adds delta to the value; if the value becomes zero, the key is marked as deleted (tombstone). If not found, inserts the key with delta as the initial value.
- operator[](key): Returns the value if the key exists, or 0 otherwise.
- has(key): Returns true if the key is present in the table.
- clear(): Zeros out all memory (both key and value arrays).
- sort(): Not a traditional sort -- provides sorted access (used for output formatting).
- size(): Linear scan counting entries with key > 0. This is O(capacity), not O(1).
Probing: Uses the same JUMP_ macro as hybrid_map for quadratic probing: idx = (idx + num_probes) & (capacity_ - 1).
Memory layout: The mem_block_ of size 2 * capacity_ is split into two halves: the first half is key_ and the second half is value_, each of size capacity_.
Primary use case: Each LightDocSampler has a doc_topic_counter_ of type light_hash_map with capacity 1024. At the start of each document (DocInit()), it is cleared and populated from the document's topic assignments via LDADocument::GetDocTopicCounter(). During sampling, it tracks real-time topic count changes via inc(old_topic, -1) and inc(new_topic, +1).
Usage
Used internally by LightDocSampler as the document-topic count table (doc_topic_counter_). Also used in LdaEngine::DumpDocTopicTable() for output. Not exposed to managed code.
Code Reference
Source Location
- Repository: Dotnet_Machinelearning
- File: src/Native/LdaNative/light_hash_map.h (189 lines)
Signature
namespace lda {
class light_hash_map {
public:
light_hash_map();
light_hash_map(int32_t* mem_block, int32_t capacity);
light_hash_map(int32_t capacity);
~light_hash_map();
void clear();
void set_memory(int32_t* mem_block);
void sort();
int32_t capacity() const;
int32_t size() const; // O(capacity) scan
int32_t* key() const;
int32_t* value() const;
bool has(int32_t key) const;
void inc(int32_t key, int32_t delta);
int32_t operator[](int32_t key);
private:
std::pair<int32_t, int32_t> find_position(const int32_t key) const;
bool own_memory_;
int32_t capacity_; // must be power of 2
int32_t* mem_block_; // size = 2 * capacity_
int32_t* key_; // mem_block_[0..capacity_-1]
int32_t* value_; // mem_block_[capacity_..2*capacity_-1]
int32_t empty_key_; // = 0
int32_t deleted_key_; // = -2
};
}
Import
// light_hash_map is an internal C++ class; not directly exposed via P/Invoke.
// Document-topic counts are accessed via LdaEngine's GetDocTopic exported function.
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| capacity | int32_t | Yes | Hash table capacity. Must be a power of 2. Should be >= 2x expected max entries. |
| mem_block | int32_t* | No | External memory block of size 2*capacity. If not provided, memory is self-allocated. |
| key (inc) | int32_t | Yes | Topic ID (0-based) to increment or query |
| delta (inc) | int32_t | Yes | Count change to apply (typically +1 or -1) |
Outputs
| Name | Type | Description |
|---|---|---|
| operator[] return | int32_t | Count for the queried topic, or 0 if not present |
| has() return | bool | Whether the topic exists in the table |
| size() | int32_t | Number of non-empty, non-deleted entries (computed via linear scan) |
| key() | int32_t* | Pointer to the key array (internal keys = topic_id + 1) |
| value() | int32_t* | Pointer to the value array (topic counts) |
Key Differences from hybrid_map
| Aspect | light_hash_map | hybrid_map |
|---|---|---|
| Modes | Sparse only | Dense or sparse (adaptive) |
| Memory ownership | Optional (own_memory_ flag) | Never owns (managed by LDAModelBlock) |
| Rehashing | None (no tombstone cleanup) | Automatic rehashing when tombstones > 5% capacity |
| Deleted key | -2 | -1 |
| Use case | Document-topic counts (per-doc, rebuilt each doc) | Word-topic counts (persistent across iterations) |
| External rehash buffer | Not needed | Required for sparse mode |
| Copy semantics | Deleted (no copy) | Copyable (shallow copy of pointers) |
Usage Examples
// In LightDocSampler::DocInit -- rebuild doc-topic counter:
doc_topic_counter_.clear();
doc->GetDocTopicCounter(doc_topic_counter_);
// During sampling -- update counts:
doc_topic_counter_.inc(old_topic, -1);
doc_topic_counter_.inc(new_topic, +1);
// Query count for MH acceptance ratio:
float n_td_alpha = doc_topic_counter_[t] + alpha_;
// In GetDocTopic -- iterate over non-zero entries:
int32_t capacity = doc_topic_counter_.capacity();
int32_t* key = doc_topic_counter_.key();
int32_t* value = doc_topic_counter_.value();
for (int i = 0; i < capacity; ++i) {
if (key[i] > 0) {
// topic = key[i] - 1, count = value[i]
}
}