Principle:Dotnet Machinelearning Hybrid Dense Sparse Storage
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, Topic_Modeling, Memory_Optimization |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Hybrid dense/sparse storage is an adaptive data structure strategy that selects between a dense array and a sparse hash table representation for each row of a matrix based on the row's fill ratio, optimizing both memory usage and access time for word-topic count matrices in topic models.
Description
In LDA topic modeling, the word-topic count matrix n(w, k) has V rows (one per word) and K columns (one per topic). The distribution of nonzero entries across rows is highly skewed:
- High-frequency words (e.g., "the", "and") appear in many topics and have dense rows where most of the K entries are nonzero.
- Low-frequency words (e.g., rare technical terms) appear in very few topics and have extremely sparse rows where only a handful of the K entries are nonzero.
Using a uniform storage strategy -- either all-dense or all-sparse -- leads to suboptimal performance:
- All-dense: Wastes memory on sparse rows. For V=100K, K=1000, this means 400MB even when most entries are zero.
- All-sparse: Incurs hash table overhead (probing, tombstone management) on dense rows where simple array indexing would be faster.
The hybrid approach used in ML.NET's LDA implementation makes a per-row decision at initialization time based on the word's term frequency (tf), which is a strong predictor of row density.
Usage
Use hybrid dense/sparse storage when:
- The matrix has rows with widely varying numbers of nonzero entries
- Both memory efficiency and access speed are important
- The density of each row can be predicted or measured at initialization time
- The matrix is accessed with both random lookups (operator[]) and sequential scans (for alias table building)
Theoretical Basis
Density Threshold Decision
For each word w with term frequency tf(w), the storage mode is selected as:
Algorithm: Hybrid Storage Mode Selection
Parameters:
K = number of topics
load_factor = 2 (hash table load factor)
hot_thresh = K / (2 * load_factor) = K / 4
For word-topic storage:
If tf(w) >= hot_thresh:
Mode = DENSE
Capacity = K
Memory = K * sizeof(int32_t) // direct array
Else if tf(w) > 0:
Mode = SPARSE
Capacity = next_power_of_2(load_factor * tf(w))
Memory = 2 * Capacity * sizeof(int32_t) // keys + values
Else:
Mode = DENSE (empty)
Capacity = 0, Memory = 0
For alias table storage:
alias_hot_thresh = (K * 2) / 3
If tf(w) >= alias_hot_thresh:
Mode = DENSE_ALIAS
Capacity = K
Memory = 2 * K * sizeof(int32_t) // K pairs of (alias, boundary)
Else if tf(w) > 0:
Mode = SPARSE_ALIAS
Capacity = tf(w)
Memory = 3 * tf(w) * sizeof(int32_t) // pairs + indirection index
Else:
Capacity = 0, Memory = 0
Memory Analysis
| Storage Mode | Memory per Row | Access Time | Best For |
|---|---|---|---|
| Dense array | K * 4 bytes | O(1) direct index | tf >= K/4 (many topics) |
| Sparse hash table | 2 * next_pow2(2*tf) * 4 bytes | O(1) amortized (probing) | tf < K/4 (few topics) |
| All-dense baseline | K * 4 bytes always | O(1) | Reference (wasteful for sparse rows) |
| All-sparse baseline | 2 * next_pow2(2*tf) * 4 bytes always | O(1) amortized | Reference (slow for dense rows) |
For a typical scenario with K=1000 topics and a Zipfian word frequency distribution:
- Top 5% of words (high tf): use dense storage, ~4KB per row
- Bottom 95% of words (low tf): use sparse storage, ~16-256 bytes per row depending on tf
- Memory savings: Often 5-10x compared to all-dense storage
Dense Mode Operations
// Dense: memory_[key] is the count for topic key
void inc(int32_t key, int32_t delta) {
memory_[key] += delta; // Direct array access
}
int32_t operator[](int32_t key) const {
return memory_[key]; // O(1) lookup
}
Sparse Mode Operations
// Sparse: open-address hash table with quadratic probing
// Keys stored as (topic_id + 1) to distinguish from empty_key (0)
// Deleted entries marked with deleted_key (-1)
void inc(int32_t key, int32_t delta) {
int32_t internal_key = key + 1;
auto pos = find_position(internal_key);
if (pos.first != ILLEGAL_BUCKET) {
value_[pos.first] += delta;
if (value_[pos.first] == 0) {
key_[pos.first] = deleted_key_;
++num_deleted_key_;
if (num_deleted_key_ * 20 > capacity_) {
rehashing(); // Compact when tombstones > 5%
}
}
} else {
key_[pos.second] = internal_key;
value_[pos.second] = delta;
}
}
Quadratic Probing
The sparse hash table uses quadratic probing with the sequence:
where C is the capacity (power of 2). This is implemented as:
#define JUMP_(key, num_probes) ( num_probes )
// idx = (idx + JUMP_(key, num_probes)) & (capacity_ - 1)
Quadratic probing reduces clustering compared to linear probing while maintaining cache-friendly access patterns.
Tombstone Management
When a topic count decrements to zero, the entry becomes a tombstone (deleted_key_ = -1 or -2). Tombstones must be preserved for correct probing (they indicate the probe sequence continues), but too many tombstones degrade performance.
hybrid_map: Triggers rehashing() when tombstones exceed 5% of capacity (num_deleted_key_ * 20 > capacity_). Rehashing copies all live entries through an external buffer and re-inserts them into a clean table.
light_hash_map: Does not rehash (the table is cleared and rebuilt for each new document, so tombstones never accumulate significantly).
Unified Memory Allocation
A key optimization is that all hybrid_map and hybrid_alias_map instances share a single memory block allocated by LDAModelBlock. Each word's row is a view (pointer + offset) into this block. This provides:
- Single allocation: One large malloc instead of V small ones, reducing heap fragmentation.
- Cache locality: Frequently accessed words' data can be close in memory.
- Simple serialization: The entire model can be saved/loaded as two memory blocks plus a dictionary of offsets.
LDAModelBlock memory layout:
mem_block_ (single allocation):
[Word0: K ints (dense)] [Word1: 2*cap ints (sparse)] [Word2: K ints (dense)] ...
alias_mem_block_ (single allocation):
[Word0: 2*K ints (dense)] [Word1: 3*tf ints (sparse)] [Word2: 2*K ints (dense)] ...
dict_ (WordEntry array):
[{offset=0, cap=K, is_dense=1}, {offset=K, cap=pow2, is_dense=0}, ...]
Training vs. Testing Storage Strategy
| Phase | Storage Strategy | Rationale |
|---|---|---|
| Training | Hybrid (dense for hot words, sparse for cold) | Maximizes sampling speed for frequently encountered words while saving memory |
| Testing/Inference | Fully sparse (fullSparse=true) | After training, counts are known and typically sparse; saves memory when loading a trained model |
| Model serialization | Based on actual nonzero counts (CountNonZero + GetModelSizeByTFS) | Accurate sizing for the persisted model |
Related Pages
- Implementation:Dotnet_Machinelearning_LdaHybridMap
- Implementation:Dotnet_Machinelearning_LdaHybridAliasMap
- Implementation:Dotnet_Machinelearning_LdaModelBlock
- Implementation:Dotnet_Machinelearning_LdaLightHashMap
- Principle:Dotnet_Machinelearning_Latent_Dirichlet_Allocation
- Principle:Dotnet_Machinelearning_Alias_Method_Sampling