Heuristic:Vespa engine Vespa KStemmer Dictionary Loading

Knowledge Sources	Vespa Engine KStemmer.java source analysis
Domains	Optimization, NLP
Last Updated	2026-02-09 00:00 GMT

Overview

Memory optimization strategy for KStemmer dictionary using static initialization, 8-file word list split, shared default entries, and selective caching of non-exception words.

Description

The KStemmer (Krovetz Stemmer) implementation loads its entire English word dictionary into a CharArrayMap during class loading via static initialization. The dictionary is split across 8 separate data files (KStemData1-8.java) generated from an external "head_word_list.txt". This design keeps individual class file sizes manageable. The stemmer uses a shared defaultEntry instance for most words and only caches non-exception dictionary entries to avoid stale results from special-case words.

Usage

Apply this heuristic when working with the KStemmer implementation or when designing similar dictionary-backed NLP components. Understanding the loading pattern is critical for: diagnosing startup performance, memory profiling, and extending the stemmer with additional languages or dictionaries.

The Insight (Rule of Thumb)

Action 1: Dictionary is split into 8 Java files (KStemData1-8) with static String arrays.
Value: Initial CharArrayMap capacity is 1000 entries; grows dynamically as words are loaded.
Action 2: Use a shared defaultEntry instance (non-exception, null root) for most dictionary words to save memory.
Action 3: Only cache non-exception dictionary entries in matchedEntry; exception words bypass caching.
Action 4: Words <= 1 character or >= 49 characters (MaxWordLen - 1) are not stemmed.
Trade-off: Eager static initialization means all 8 data files are loaded at class load time; first-use latency but no runtime initialization cost.

Reasoning

The 8-file split keeps individual Java class files under JVM size limits and improves compilation/IDE performance. Static initialization ensures thread-safe one-time loading without synchronization overhead. The shared defaultEntry avoids allocating separate objects for every non-exceptional dictionary word. Exception word caching avoidance prevents stale results when the same KStemmer instance processes multiple words (exception words need fresh lookup each time because their root may vary by context). The word length bounds (1-49 chars) avoid wasting CPU on trivially short strings or nonsensical long strings that have no valid stems.

Code Evidence

Static initialization from KStemmer.java:178:

private static final CharArrayMap<DictEntry> dict_ht = initializeDictHash();

Initial capacity from KStemmer.java:220:

CharArrayMap<DictEntry> d = new CharArrayMap<>(1000, false);

Shared default entry from KStemmer.java:251:

defaultEntry = new DictEntry(null, false);

Selective caching from KStemmer.java:413-420:

private DictEntry wordInDict() {
    if (matchedEntry != null) return matchedEntry;
    DictEntry e = dict_ht.get(word.getArray(), 0, word.length());
    if (e != null && !e.exception) {
        matchedEntry = e; // only cache if it's not an exception.
    }
    return e;
}

Word length constraint from KStemmer.java:15, 1328:

static private final int MaxWordLen = 50;

// In stem():
if ((k <= 1) || (k >= MaxWordLen - 1)) {
    return false; // don't stem
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment