Implementation:Dotnet Machinelearning LdaLightHashMap

Knowledge Sources	Dotnet_Machinelearning
Domains	Topic_Modeling, Data_Structures, Hash_Tables
Last Updated	2026-02-09 12:00 GMT

Overview

light_hash_map is a lightweight, fixed-capacity hash table with quadratic probing, used to store local document-topic counts during LDA Gibbs sampling and inference.

Description

The light_hash_map class (in the lda namespace) provides a minimal hash table optimized for the specific use case of tracking per-document topic counts n(d,k) during LDA inference. It borrows design ideas from Google's dense_hash_map but with significant simplifications:

Design constraints:

Fixed capacity: The table never resizes. Capacity must be set at construction and should be at least twice the maximum number of entries to maintain a low load factor.
Power-of-2 capacity: Required for efficient modular arithmetic using bitwise AND.
int32_t key-value pairs only: No generic type support; both keys and values are 32-bit integers.
Optional memory ownership: Can allocate its own memory (own_memory_ = true) or use an external memory block via set_memory().
Fixed sentinel values: empty_key_ = 0, deleted_key_ = -2. Keys are stored as internal_key = topic_id + 1 to avoid collision with the empty sentinel.

Operations:

inc(key, delta): Finds the key's position via quadratic probing. If found, adds delta to the value; if the value becomes zero, the key is marked as deleted (tombstone). If not found, inserts the key with delta as the initial value.
operator[](key): Returns the value if the key exists, or 0 otherwise.
has(key): Returns true if the key is present in the table.
clear(): Zeros out all memory (both key and value arrays).
sort(): Not a traditional sort -- provides sorted access (used for output formatting).
size(): Linear scan counting entries with key > 0. This is O(capacity), not O(1).

Probing: Uses the same JUMP_ macro as hybrid_map for quadratic probing: idx = (idx + num_probes) & (capacity_ - 1).

Memory layout: The mem_block_ of size 2 * capacity_ is split into two halves: the first half is key_ and the second half is value_, each of size capacity_.

Primary use case: Each LightDocSampler has a doc_topic_counter_ of type light_hash_map with capacity 1024. At the start of each document (DocInit()), it is cleared and populated from the document's topic assignments via LDADocument::GetDocTopicCounter(). During sampling, it tracks real-time topic count changes via inc(old_topic, -1) and inc(new_topic, +1).

Usage

Used internally by LightDocSampler as the document-topic count table (doc_topic_counter_). Also used in LdaEngine::DumpDocTopicTable() for output. Not exposed to managed code.

Code Reference

Source Location

Repository: Dotnet_Machinelearning
File: src/Native/LdaNative/light_hash_map.h (189 lines)

Signature

namespace lda {
    class light_hash_map {
    public:
        light_hash_map();
        light_hash_map(int32_t* mem_block, int32_t capacity);
        light_hash_map(int32_t capacity);
        ~light_hash_map();

        void clear();
        void set_memory(int32_t* mem_block);
        void sort();

        int32_t capacity() const;
        int32_t size() const;           // O(capacity) scan
        int32_t* key() const;
        int32_t* value() const;
        bool has(int32_t key) const;

        void inc(int32_t key, int32_t delta);
        int32_t operator[](int32_t key);

    private:
        std::pair<int32_t, int32_t> find_position(const int32_t key) const;

        bool own_memory_;
        int32_t capacity_;       // must be power of 2
        int32_t* mem_block_;     // size = 2 * capacity_
        int32_t* key_;           // mem_block_[0..capacity_-1]
        int32_t* value_;         // mem_block_[capacity_..2*capacity_-1]
        int32_t empty_key_;      // = 0
        int32_t deleted_key_;    // = -2
    };
}

Import

// light_hash_map is an internal C++ class; not directly exposed via P/Invoke.
// Document-topic counts are accessed via LdaEngine's GetDocTopic exported function.

I/O Contract

Inputs

Name	Type	Required	Description
capacity	int32_t	Yes	Hash table capacity. Must be a power of 2. Should be >= 2x expected max entries.
mem_block	int32_t*	No	External memory block of size 2*capacity. If not provided, memory is self-allocated.
key (inc)	int32_t	Yes	Topic ID (0-based) to increment or query
delta (inc)	int32_t	Yes	Count change to apply (typically +1 or -1)

Outputs

Name	Type	Description
operator[] return	int32_t	Count for the queried topic, or 0 if not present
has() return	bool	Whether the topic exists in the table
size()	int32_t	Number of non-empty, non-deleted entries (computed via linear scan)
key()	int32_t*	Pointer to the key array (internal keys = topic_id + 1)
value()	int32_t*	Pointer to the value array (topic counts)

Key Differences from hybrid_map

Aspect	light_hash_map	hybrid_map
Modes	Sparse only	Dense or sparse (adaptive)
Memory ownership	Optional (own_memory_ flag)	Never owns (managed by LDAModelBlock)
Rehashing	None (no tombstone cleanup)	Automatic rehashing when tombstones > 5% capacity
Deleted key	-2	-1
Use case	Document-topic counts (per-doc, rebuilt each doc)	Word-topic counts (persistent across iterations)
External rehash buffer	Not needed	Required for sparse mode
Copy semantics	Deleted (no copy)	Copyable (shallow copy of pointers)

Usage Examples

// In LightDocSampler::DocInit -- rebuild doc-topic counter:
doc_topic_counter_.clear();
doc->GetDocTopicCounter(doc_topic_counter_);

// During sampling -- update counts:
doc_topic_counter_.inc(old_topic, -1);
doc_topic_counter_.inc(new_topic, +1);

// Query count for MH acceptance ratio:
float n_td_alpha = doc_topic_counter_[t] + alpha_;

// In GetDocTopic -- iterate over non-zero entries:
int32_t capacity = doc_topic_counter_.capacity();
int32_t* key = doc_topic_counter_.key();
int32_t* value = doc_topic_counter_.value();
for (int i = 0; i < capacity; ++i) {
    if (key[i] > 0) {
        // topic = key[i] - 1, count = value[i]
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment