Principle:Dotnet Machinelearning Hybrid Dense Sparse Storage

Knowledge Sources	Dotnet_Machinelearning LightLDA: Big Topic Models on Modest Computer Clusters
Domains	Data_Structures, Topic_Modeling, Memory_Optimization
Last Updated	2026-02-09 12:00 GMT

Overview

Hybrid dense/sparse storage is an adaptive data structure strategy that selects between a dense array and a sparse hash table representation for each row of a matrix based on the row's fill ratio, optimizing both memory usage and access time for word-topic count matrices in topic models.

Description

In LDA topic modeling, the word-topic count matrix n(w, k) has V rows (one per word) and K columns (one per topic). The distribution of nonzero entries across rows is highly skewed:

High-frequency words (e.g., "the", "and") appear in many topics and have dense rows where most of the K entries are nonzero.
Low-frequency words (e.g., rare technical terms) appear in very few topics and have extremely sparse rows where only a handful of the K entries are nonzero.

Using a uniform storage strategy -- either all-dense or all-sparse -- leads to suboptimal performance:

All-dense: Wastes memory on sparse rows. For V=100K, K=1000, this means 400MB even when most entries are zero.
All-sparse: Incurs hash table overhead (probing, tombstone management) on dense rows where simple array indexing would be faster.

The hybrid approach used in ML.NET's LDA implementation makes a per-row decision at initialization time based on the word's term frequency (tf), which is a strong predictor of row density.

Usage

Use hybrid dense/sparse storage when:

The matrix has rows with widely varying numbers of nonzero entries
Both memory efficiency and access speed are important
The density of each row can be predicted or measured at initialization time
The matrix is accessed with both random lookups (operator[]) and sequential scans (for alias table building)

Theoretical Basis

Density Threshold Decision

For each word w with term frequency tf(w), the storage mode is selected as:

Algorithm: Hybrid Storage Mode Selection

Parameters:
  K = number of topics
  load_factor = 2 (hash table load factor)
  hot_thresh = K / (2 * load_factor)  = K / 4

For word-topic storage:
  If tf(w) >= hot_thresh:
    Mode = DENSE
    Capacity = K
    Memory = K * sizeof(int32_t)          // direct array
  Else if tf(w) > 0:
    Mode = SPARSE
    Capacity = next_power_of_2(load_factor * tf(w))
    Memory = 2 * Capacity * sizeof(int32_t)  // keys + values
  Else:
    Mode = DENSE (empty)
    Capacity = 0, Memory = 0

For alias table storage:
  alias_hot_thresh = (K * 2) / 3

  If tf(w) >= alias_hot_thresh:
    Mode = DENSE_ALIAS
    Capacity = K
    Memory = 2 * K * sizeof(int32_t)     // K pairs of (alias, boundary)
  Else if tf(w) > 0:
    Mode = SPARSE_ALIAS
    Capacity = tf(w)
    Memory = 3 * tf(w) * sizeof(int32_t) // pairs + indirection index
  Else:
    Capacity = 0, Memory = 0

Memory Analysis

Storage Mode	Memory per Row	Access Time	Best For
Dense array	K * 4 bytes	O(1) direct index	tf >= K/4 (many topics)
Sparse hash table	2 * next_pow2(2tf) 4 bytes	O(1) amortized (probing)	tf < K/4 (few topics)
All-dense baseline	K * 4 bytes always	O(1)	Reference (wasteful for sparse rows)
All-sparse baseline	2 * next_pow2(2tf) 4 bytes always	O(1) amortized	Reference (slow for dense rows)

For a typical scenario with K=1000 topics and a Zipfian word frequency distribution:

Top 5% of words (high tf): use dense storage, ~4KB per row
Bottom 95% of words (low tf): use sparse storage, ~16-256 bytes per row depending on tf
Memory savings: Often 5-10x compared to all-dense storage

Dense Mode Operations

// Dense: memory_[key] is the count for topic key
void inc(int32_t key, int32_t delta) {
    memory_[key] += delta;  // Direct array access
}
int32_t operator[](int32_t key) const {
    return memory_[key];    // O(1) lookup
}

Sparse Mode Operations

// Sparse: open-address hash table with quadratic probing
// Keys stored as (topic_id + 1) to distinguish from empty_key (0)
// Deleted entries marked with deleted_key (-1)

void inc(int32_t key, int32_t delta) {
    int32_t internal_key = key + 1;
    auto pos = find_position(internal_key);
    if (pos.first != ILLEGAL_BUCKET) {
        value_[pos.first] += delta;
        if (value_[pos.first] == 0) {
            key_[pos.first] = deleted_key_;
            ++num_deleted_key_;
            if (num_deleted_key_ * 20 > capacity_) {
                rehashing();  // Compact when tombstones > 5%
            }
        }
    } else {
        key_[pos.second] = internal_key;
        value_[pos.second] = delta;
    }
}

Quadratic Probing

The sparse hash table uses quadratic probing with the sequence:

${idx}_{i + 1} = ({idx}_{0} + 1 + 2 + \dots + i) mod C = ({idx}_{0} + \frac{i (i + 1)}{2}) mod C$

where C is the capacity (power of 2). This is implemented as:

#define JUMP_(key, num_probes)    ( num_probes )
// idx = (idx + JUMP_(key, num_probes)) & (capacity_ - 1)

Quadratic probing reduces clustering compared to linear probing while maintaining cache-friendly access patterns.

Tombstone Management

When a topic count decrements to zero, the entry becomes a tombstone (deleted_key_ = -1 or -2). Tombstones must be preserved for correct probing (they indicate the probe sequence continues), but too many tombstones degrade performance.

hybrid_map: Triggers rehashing() when tombstones exceed 5% of capacity (num_deleted_key_ * 20 > capacity_). Rehashing copies all live entries through an external buffer and re-inserts them into a clean table.

light_hash_map: Does not rehash (the table is cleared and rebuilt for each new document, so tombstones never accumulate significantly).

Unified Memory Allocation

A key optimization is that all hybrid_map and hybrid_alias_map instances share a single memory block allocated by LDAModelBlock. Each word's row is a view (pointer + offset) into this block. This provides:

Single allocation: One large malloc instead of V small ones, reducing heap fragmentation.
Cache locality: Frequently accessed words' data can be close in memory.
Simple serialization: The entire model can be saved/loaded as two memory blocks plus a dictionary of offsets.

LDAModelBlock memory layout:

mem_block_ (single allocation):
[Word0: K ints (dense)] [Word1: 2*cap ints (sparse)] [Word2: K ints (dense)] ...

alias_mem_block_ (single allocation):
[Word0: 2*K ints (dense)] [Word1: 3*tf ints (sparse)] [Word2: 2*K ints (dense)] ...

dict_ (WordEntry array):
[{offset=0, cap=K, is_dense=1}, {offset=K, cap=pow2, is_dense=0}, ...]

Training vs. Testing Storage Strategy

Phase	Storage Strategy	Rationale
Training	Hybrid (dense for hot words, sparse for cold)	Maximizes sampling speed for frequently encountered words while saving memory
Testing/Inference	Fully sparse (fullSparse=true)	After training, counts are known and typically sparse; saves memory when loading a trained model
Model serialization	Based on actual nonzero counts (CountNonZero + GetModelSizeByTFS)	Accurate sizing for the persisted model

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment