Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Dotnet Machinelearning Hybrid Dense Sparse Storage

From Leeroopedia


Knowledge Sources
Domains Data_Structures, Topic_Modeling, Memory_Optimization
Last Updated 2026-02-09 12:00 GMT

Overview

Hybrid dense/sparse storage is an adaptive data structure strategy that selects between a dense array and a sparse hash table representation for each row of a matrix based on the row's fill ratio, optimizing both memory usage and access time for word-topic count matrices in topic models.

Description

In LDA topic modeling, the word-topic count matrix n(w, k) has V rows (one per word) and K columns (one per topic). The distribution of nonzero entries across rows is highly skewed:

  • High-frequency words (e.g., "the", "and") appear in many topics and have dense rows where most of the K entries are nonzero.
  • Low-frequency words (e.g., rare technical terms) appear in very few topics and have extremely sparse rows where only a handful of the K entries are nonzero.

Using a uniform storage strategy -- either all-dense or all-sparse -- leads to suboptimal performance:

  • All-dense: Wastes memory on sparse rows. For V=100K, K=1000, this means 400MB even when most entries are zero.
  • All-sparse: Incurs hash table overhead (probing, tombstone management) on dense rows where simple array indexing would be faster.

The hybrid approach used in ML.NET's LDA implementation makes a per-row decision at initialization time based on the word's term frequency (tf), which is a strong predictor of row density.

Usage

Use hybrid dense/sparse storage when:

  • The matrix has rows with widely varying numbers of nonzero entries
  • Both memory efficiency and access speed are important
  • The density of each row can be predicted or measured at initialization time
  • The matrix is accessed with both random lookups (operator[]) and sequential scans (for alias table building)

Theoretical Basis

Density Threshold Decision

For each word w with term frequency tf(w), the storage mode is selected as:

Algorithm: Hybrid Storage Mode Selection

Parameters:
  K = number of topics
  load_factor = 2 (hash table load factor)
  hot_thresh = K / (2 * load_factor)  = K / 4

For word-topic storage:
  If tf(w) >= hot_thresh:
    Mode = DENSE
    Capacity = K
    Memory = K * sizeof(int32_t)          // direct array
  Else if tf(w) > 0:
    Mode = SPARSE
    Capacity = next_power_of_2(load_factor * tf(w))
    Memory = 2 * Capacity * sizeof(int32_t)  // keys + values
  Else:
    Mode = DENSE (empty)
    Capacity = 0, Memory = 0

For alias table storage:
  alias_hot_thresh = (K * 2) / 3

  If tf(w) >= alias_hot_thresh:
    Mode = DENSE_ALIAS
    Capacity = K
    Memory = 2 * K * sizeof(int32_t)     // K pairs of (alias, boundary)
  Else if tf(w) > 0:
    Mode = SPARSE_ALIAS
    Capacity = tf(w)
    Memory = 3 * tf(w) * sizeof(int32_t) // pairs + indirection index
  Else:
    Capacity = 0, Memory = 0

Memory Analysis

Storage Mode Memory per Row Access Time Best For
Dense array K * 4 bytes O(1) direct index tf >= K/4 (many topics)
Sparse hash table 2 * next_pow2(2*tf) * 4 bytes O(1) amortized (probing) tf < K/4 (few topics)
All-dense baseline K * 4 bytes always O(1) Reference (wasteful for sparse rows)
All-sparse baseline 2 * next_pow2(2*tf) * 4 bytes always O(1) amortized Reference (slow for dense rows)

For a typical scenario with K=1000 topics and a Zipfian word frequency distribution:

  • Top 5% of words (high tf): use dense storage, ~4KB per row
  • Bottom 95% of words (low tf): use sparse storage, ~16-256 bytes per row depending on tf
  • Memory savings: Often 5-10x compared to all-dense storage

Dense Mode Operations

// Dense: memory_[key] is the count for topic key
void inc(int32_t key, int32_t delta) {
    memory_[key] += delta;  // Direct array access
}
int32_t operator[](int32_t key) const {
    return memory_[key];    // O(1) lookup
}

Sparse Mode Operations

// Sparse: open-address hash table with quadratic probing
// Keys stored as (topic_id + 1) to distinguish from empty_key (0)
// Deleted entries marked with deleted_key (-1)

void inc(int32_t key, int32_t delta) {
    int32_t internal_key = key + 1;
    auto pos = find_position(internal_key);
    if (pos.first != ILLEGAL_BUCKET) {
        value_[pos.first] += delta;
        if (value_[pos.first] == 0) {
            key_[pos.first] = deleted_key_;
            ++num_deleted_key_;
            if (num_deleted_key_ * 20 > capacity_) {
                rehashing();  // Compact when tombstones > 5%
            }
        }
    } else {
        key_[pos.second] = internal_key;
        value_[pos.second] = delta;
    }
}

Quadratic Probing

The sparse hash table uses quadratic probing with the sequence:

idxi+1=(idx0+1+2++i)modC=(idx0+i(i+1)2)modC

where C is the capacity (power of 2). This is implemented as:

#define JUMP_(key, num_probes)    ( num_probes )
// idx = (idx + JUMP_(key, num_probes)) & (capacity_ - 1)

Quadratic probing reduces clustering compared to linear probing while maintaining cache-friendly access patterns.

Tombstone Management

When a topic count decrements to zero, the entry becomes a tombstone (deleted_key_ = -1 or -2). Tombstones must be preserved for correct probing (they indicate the probe sequence continues), but too many tombstones degrade performance.

hybrid_map: Triggers rehashing() when tombstones exceed 5% of capacity (num_deleted_key_ * 20 > capacity_). Rehashing copies all live entries through an external buffer and re-inserts them into a clean table.

light_hash_map: Does not rehash (the table is cleared and rebuilt for each new document, so tombstones never accumulate significantly).

Unified Memory Allocation

A key optimization is that all hybrid_map and hybrid_alias_map instances share a single memory block allocated by LDAModelBlock. Each word's row is a view (pointer + offset) into this block. This provides:

  • Single allocation: One large malloc instead of V small ones, reducing heap fragmentation.
  • Cache locality: Frequently accessed words' data can be close in memory.
  • Simple serialization: The entire model can be saved/loaded as two memory blocks plus a dictionary of offsets.
LDAModelBlock memory layout:

mem_block_ (single allocation):
[Word0: K ints (dense)] [Word1: 2*cap ints (sparse)] [Word2: K ints (dense)] ...

alias_mem_block_ (single allocation):
[Word0: 2*K ints (dense)] [Word1: 3*tf ints (sparse)] [Word2: 2*K ints (dense)] ...

dict_ (WordEntry array):
[{offset=0, cap=K, is_dense=1}, {offset=K, cap=pow2, is_dense=0}, ...]

Training vs. Testing Storage Strategy

Phase Storage Strategy Rationale
Training Hybrid (dense for hot words, sparse for cold) Maximizes sampling speed for frequently encountered words while saving memory
Testing/Inference Fully sparse (fullSparse=true) After training, counts are known and typically sparse; saves memory when loading a trained model
Model serialization Based on actual nonzero counts (CountNonZero + GetModelSizeByTFS) Accurate sizing for the persisted model

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment