Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Dotnet Machinelearning LdaLightHashMap

From Leeroopedia


Knowledge Sources
Domains Topic_Modeling, Data_Structures, Hash_Tables
Last Updated 2026-02-09 12:00 GMT

Overview

light_hash_map is a lightweight, fixed-capacity hash table with quadratic probing, used to store local document-topic counts during LDA Gibbs sampling and inference.

Description

The light_hash_map class (in the lda namespace) provides a minimal hash table optimized for the specific use case of tracking per-document topic counts n(d,k) during LDA inference. It borrows design ideas from Google's dense_hash_map but with significant simplifications:

Design constraints:

  • Fixed capacity: The table never resizes. Capacity must be set at construction and should be at least twice the maximum number of entries to maintain a low load factor.
  • Power-of-2 capacity: Required for efficient modular arithmetic using bitwise AND.
  • int32_t key-value pairs only: No generic type support; both keys and values are 32-bit integers.
  • Optional memory ownership: Can allocate its own memory (own_memory_ = true) or use an external memory block via set_memory().
  • Fixed sentinel values: empty_key_ = 0, deleted_key_ = -2. Keys are stored as internal_key = topic_id + 1 to avoid collision with the empty sentinel.

Operations:

  • inc(key, delta): Finds the key's position via quadratic probing. If found, adds delta to the value; if the value becomes zero, the key is marked as deleted (tombstone). If not found, inserts the key with delta as the initial value.
  • operator[](key): Returns the value if the key exists, or 0 otherwise.
  • has(key): Returns true if the key is present in the table.
  • clear(): Zeros out all memory (both key and value arrays).
  • sort(): Not a traditional sort -- provides sorted access (used for output formatting).
  • size(): Linear scan counting entries with key > 0. This is O(capacity), not O(1).

Probing: Uses the same JUMP_ macro as hybrid_map for quadratic probing: idx = (idx + num_probes) & (capacity_ - 1).

Memory layout: The mem_block_ of size 2 * capacity_ is split into two halves: the first half is key_ and the second half is value_, each of size capacity_.

Primary use case: Each LightDocSampler has a doc_topic_counter_ of type light_hash_map with capacity 1024. At the start of each document (DocInit()), it is cleared and populated from the document's topic assignments via LDADocument::GetDocTopicCounter(). During sampling, it tracks real-time topic count changes via inc(old_topic, -1) and inc(new_topic, +1).

Usage

Used internally by LightDocSampler as the document-topic count table (doc_topic_counter_). Also used in LdaEngine::DumpDocTopicTable() for output. Not exposed to managed code.

Code Reference

Source Location

Signature

namespace lda {
    class light_hash_map {
    public:
        light_hash_map();
        light_hash_map(int32_t* mem_block, int32_t capacity);
        light_hash_map(int32_t capacity);
        ~light_hash_map();

        void clear();
        void set_memory(int32_t* mem_block);
        void sort();

        int32_t capacity() const;
        int32_t size() const;           // O(capacity) scan
        int32_t* key() const;
        int32_t* value() const;
        bool has(int32_t key) const;

        void inc(int32_t key, int32_t delta);
        int32_t operator[](int32_t key);

    private:
        std::pair<int32_t, int32_t> find_position(const int32_t key) const;

        bool own_memory_;
        int32_t capacity_;       // must be power of 2
        int32_t* mem_block_;     // size = 2 * capacity_
        int32_t* key_;           // mem_block_[0..capacity_-1]
        int32_t* value_;         // mem_block_[capacity_..2*capacity_-1]
        int32_t empty_key_;      // = 0
        int32_t deleted_key_;    // = -2
    };
}

Import

// light_hash_map is an internal C++ class; not directly exposed via P/Invoke.
// Document-topic counts are accessed via LdaEngine's GetDocTopic exported function.

I/O Contract

Inputs

Name Type Required Description
capacity int32_t Yes Hash table capacity. Must be a power of 2. Should be >= 2x expected max entries.
mem_block int32_t* No External memory block of size 2*capacity. If not provided, memory is self-allocated.
key (inc) int32_t Yes Topic ID (0-based) to increment or query
delta (inc) int32_t Yes Count change to apply (typically +1 or -1)

Outputs

Name Type Description
operator[] return int32_t Count for the queried topic, or 0 if not present
has() return bool Whether the topic exists in the table
size() int32_t Number of non-empty, non-deleted entries (computed via linear scan)
key() int32_t* Pointer to the key array (internal keys = topic_id + 1)
value() int32_t* Pointer to the value array (topic counts)

Key Differences from hybrid_map

Aspect light_hash_map hybrid_map
Modes Sparse only Dense or sparse (adaptive)
Memory ownership Optional (own_memory_ flag) Never owns (managed by LDAModelBlock)
Rehashing None (no tombstone cleanup) Automatic rehashing when tombstones > 5% capacity
Deleted key -2 -1
Use case Document-topic counts (per-doc, rebuilt each doc) Word-topic counts (persistent across iterations)
External rehash buffer Not needed Required for sparse mode
Copy semantics Deleted (no copy) Copyable (shallow copy of pointers)

Usage Examples

// In LightDocSampler::DocInit -- rebuild doc-topic counter:
doc_topic_counter_.clear();
doc->GetDocTopicCounter(doc_topic_counter_);

// During sampling -- update counts:
doc_topic_counter_.inc(old_topic, -1);
doc_topic_counter_.inc(new_topic, +1);

// Query count for MH acceptance ratio:
float n_td_alpha = doc_topic_counter_[t] + alpha_;

// In GetDocTopic -- iterate over non-zero entries:
int32_t capacity = doc_topic_counter_.capacity();
int32_t* key = doc_topic_counter_.key();
int32_t* value = doc_topic_counter_.value();
for (int i = 0; i < capacity; ++i) {
    if (key[i] > 0) {
        // topic = key[i] - 1, count = value[i]
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment