Implementation:Dotnet Machinelearning LdaHybridMap

Knowledge Sources	Dotnet_Machinelearning
Domains	Topic_Modeling, Data_Structures, Hash_Tables
Last Updated	2026-02-09 12:00 GMT

Overview

hybrid_map is an adaptive dense/sparse hash table that stores word-topic counts for a single word in the LDA model, switching between a dense array and a quadratic-probing hash table based on the word's term frequency.

Description

The hybrid_map class (in the lda namespace) provides a unified interface for accessing topic counts n(w,k) for a single word w across all K topics. The global word-topic matrix in LdaEngine is stored as a vector<hybrid_map> of size V (vocabulary size), where each element represents one row.

Two storage modes:

Dense mode (is_dense_ = 1):

The memory_ pointer points to a contiguous array of K int32_t values, where memory_[k] = n(w,k).
inc(key, delta): Simply does memory_[key] += delta.
operator[](key): Returns memory_[key] (or 0 if capacity is 0).
Used for "hot" words where tf >= num_topics_ / (2 * load_factor_), meaning the word appears in enough topics that a dense representation is more efficient.

Sparse mode (is_dense_ = 0):

The memory_ pointer points to a flat block of 2 * capacity_ int32_t values, split into key_ (first half) and value_ (second half).
Keys are stored as internal_key = topic_id + 1 to distinguish from the empty_key_ (0). The deleted_key_ is -1.
Quadratic probing: Defined by the macro JUMP_(key, num_probes) = num_probes. The probe sequence is: idx = key % capacity_, then idx = (idx + probe_count) & (capacity_ - 1).
inc(key, delta): Finds the key's position; if found, updates the value and deletes the entry if it becomes zero (setting key to deleted_key_). If deleted keys exceed 5% of capacity (num_deleted_key_ * 20 > capacity_), triggers rehashing(). If not found, inserts a new entry.
Automatic rehashing: When too many tombstones accumulate, rehashing() copies the table to external_rehash_buf_, clears memory_, and re-inserts all live entries. This buffer is allocated per-thread to avoid contention.
sorted_rehashing(): A more expensive variant using std::map to re-insert entries in sorted key order, used during model loading.
Capacity requirement: capacity_ must be a power of 2 (for the bitwise AND masking in probing), and should be at least 2x the expected number of entries (load_factor_ = 2).

Memory ownership: The hybrid_map does not own its memory. The backing store is allocated by LDAModelBlock as a single large mem_block_ array. Each hybrid_map gets a slice via pointer arithmetic: mem_block_ + dict_[word_id].offset_.

Usage

Every word in the vocabulary has one hybrid_map in the global_word_topic_table_. During Gibbs sampling, the sampler calls inc(topic, +1) and inc(topic, -1) to update counts. The operator[] is used to query counts for MH acceptance ratio computation. The build_alias_table in hybrid_alias_map iterates over a hybrid_map's entries to compute proportions.

Code Reference

Source Location

Repository: Dotnet_Machinelearning
File: src/Native/LdaNative/hybrid_map.cpp (142 lines)
File: src/Native/LdaNative/hybrid_map.h (238 lines)

Signature

#define JUMP_(key, num_probes)    ( num_probes )  // Quadratic probing
#define ILLEGAL_BUCKET -1

namespace lda {
    class hybrid_map {
        friend class hybrid_alias_map;
    public:
        hybrid_map();
        hybrid_map(int32_t* memory, int32_t is_dense, int32_t capacity,
                   int32_t num_deleted_key, int32_t* external_rehash_buf);
        hybrid_map(const hybrid_map& other);
        hybrid_map& operator=(const hybrid_map& other);

        void clear();
        std::string DumpString() const;
        void sorted_rehashing();

        int32_t nonzero_num() const;
        bool is_dense() const;
        int32_t capacity() const;
        int32_t* memory() const;
        int32_t* key() const;
        int32_t* value() const;

        void rehashing();
        void inc(int32_t key, int32_t delta);
        int32_t operator[](int32_t key) const;

    private:
        std::pair<int32_t, int32_t> find_position(const int32_t key) const;

        int32_t* memory_;
        int32_t is_dense_;
        int32_t* key_;
        int32_t* value_;
        int32_t capacity_;
        int32_t empty_key_;      // = 0
        int32_t deleted_key_;    // = -1
        int32_t num_deleted_key_;
        int32_t* external_rehash_buf_;
    };
}

Import

// hybrid_map is an internal C++ class; not directly exposed via P/Invoke.
// The managed code accesses word-topic data through LdaEngine's
// GetWordTopic/SetWordTopic exported functions.

I/O Contract

Inputs

Name	Type	Required	Description
memory	int32_t*	Yes	Pointer to backing memory (owned by LDAModelBlock)
is_dense	int32_t	Yes	1 for dense array mode, 0 for sparse hash table mode
capacity	int32_t	Yes	Dense: array length (K). Sparse: hash table size (power of 2, >= 2 * expected entries)
num_deleted_key	int32_t	No	Initial count of tombstone entries (typically 0)
external_rehash_buf	int32_t*	No	Per-thread buffer of size 2*capacity for rehashing (sparse mode only)
key (inc)	int32_t	Yes	Topic ID to increment (0-based)
delta (inc)	int32_t	Yes	Count change (+1 or -1 typically)

Outputs

Name	Type	Description
operator[] return	int32_t	Count n(w,k) for the queried topic k, or 0 if not present
nonzero_num()	int32_t	Number of topics with non-zero counts for this word
is_dense()	bool	Whether this row uses dense storage
DumpString()	string	Human-readable representation of all nonzero entries as "topic:count" pairs

Probing Algorithm

The find_position() method implements quadratic probing with tombstone handling:

// Returns pair<found_pos, insert_pos>:
//   found_pos != ILLEGAL_BUCKET  =>  key exists at found_pos
//   found_pos == ILLEGAL_BUCKET  =>  key not found, can insert at insert_pos

std::pair<int32_t, int32_t> find_position(const int32_t key) const {
    int num_probes = 0;
    int32_t idx = key % capacity_;
    int32_t insert_pos = ILLEGAL_BUCKET;
    while (1) {
        if (key_[idx] == empty_key_) {        // empty slot
            return {ILLEGAL_BUCKET, insert_pos != ILLEGAL_BUCKET ? insert_pos : idx};
        }
        else if (key_[idx] == deleted_key_) { // tombstone
            if (insert_pos == ILLEGAL_BUCKET) insert_pos = idx;
        }
        else if (key_[idx] == key) {          // found
            return {idx, ILLEGAL_BUCKET};
        }
        ++num_probes;
        idx = (idx + JUMP_(key, num_probes)) & (capacity_ - 1);
    }
}

Usage Examples

// In LightDocSampler -- increment/decrement word-topic counts:
word_topic_table_[word].inc(old_topic, -1);  // decrement old topic
word_topic_table_[word].inc(new_topic, +1);  // increment new topic

// Query count for acceptance ratio:
int32_t n_kw = word_topic_table_[word][topic];

// Check if word has any topic assignments:
if (word_topic_table_[word].capacity() == 0) {
    // word not in model
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment