Heuristic:Vespa engine Vespa KStemmer Dictionary Loading
| Knowledge Sources | |
|---|---|
| Domains | Optimization, NLP |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Memory optimization strategy for KStemmer dictionary using static initialization, 8-file word list split, shared default entries, and selective caching of non-exception words.
Description
The KStemmer (Krovetz Stemmer) implementation loads its entire English word dictionary into a CharArrayMap during class loading via static initialization. The dictionary is split across 8 separate data files (KStemData1-8.java) generated from an external "head_word_list.txt". This design keeps individual class file sizes manageable. The stemmer uses a shared defaultEntry instance for most words and only caches non-exception dictionary entries to avoid stale results from special-case words.
Usage
Apply this heuristic when working with the KStemmer implementation or when designing similar dictionary-backed NLP components. Understanding the loading pattern is critical for: diagnosing startup performance, memory profiling, and extending the stemmer with additional languages or dictionaries.
The Insight (Rule of Thumb)
- Action 1: Dictionary is split into 8 Java files (KStemData1-8) with static String arrays.
- Value: Initial
CharArrayMapcapacity is 1000 entries; grows dynamically as words are loaded. - Action 2: Use a shared
defaultEntryinstance (non-exception, null root) for most dictionary words to save memory. - Action 3: Only cache non-exception dictionary entries in
matchedEntry; exception words bypass caching. - Action 4: Words <= 1 character or >= 49 characters (
MaxWordLen - 1) are not stemmed. - Trade-off: Eager static initialization means all 8 data files are loaded at class load time; first-use latency but no runtime initialization cost.
Reasoning
The 8-file split keeps individual Java class files under JVM size limits and improves compilation/IDE performance. Static initialization ensures thread-safe one-time loading without synchronization overhead. The shared defaultEntry avoids allocating separate objects for every non-exceptional dictionary word. Exception word caching avoidance prevents stale results when the same KStemmer instance processes multiple words (exception words need fresh lookup each time because their root may vary by context). The word length bounds (1-49 chars) avoid wasting CPU on trivially short strings or nonsensical long strings that have no valid stems.
Code Evidence
Static initialization from KStemmer.java:178:
private static final CharArrayMap<DictEntry> dict_ht = initializeDictHash();
Initial capacity from KStemmer.java:220:
CharArrayMap<DictEntry> d = new CharArrayMap<>(1000, false);
Shared default entry from KStemmer.java:251:
defaultEntry = new DictEntry(null, false);
Selective caching from KStemmer.java:413-420:
private DictEntry wordInDict() {
if (matchedEntry != null) return matchedEntry;
DictEntry e = dict_ht.get(word.getArray(), 0, word.length());
if (e != null && !e.exception) {
matchedEntry = e; // only cache if it's not an exception.
}
return e;
}
Word length constraint from KStemmer.java:15, 1328:
static private final int MaxWordLen = 50;
// In stem():
if ((k <= 1) || (k >= MaxWordLen - 1)) {
return false; // don't stem
}