Implementation:Vespa engine Vespa KStemmer Stem
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for reducing English words to their base form provided by Vespa's linguistics library. Implements the Krovetz (KStem) stemming algorithm, which uses dictionary lookup combined with morphological suffix-stripping rules to produce readable stems.
Description
The KStemmer class implements the Krovetz stemming algorithm, a lightweight English stemmer originally derived from the Apache Lucene project. Unlike the more aggressive Porter stemmer, KStem prioritizes producing stems that are actual English words by consulting an internal dictionary before and after applying suffix-removal rules.
The stem method is the public entry point. It accepts a string term, converts it to a character array, and delegates to an internal stem(char[], int) method that performs the actual stemming logic. If the word was modified during stemming, the modified form is returned; otherwise, the original term is returned unchanged.
Key internal data structures:
dict_ht(CharArrayMap<DictEntry>): A hash map containing the stemmer's dictionary. Each entry maps a word to aDictEntrythat indicates whether the word is a valid root form and optionally provides an exception mapping.word(OpenStringBuilder): A mutable string buffer holding the current word being processed. Suffix-removal operations modify this buffer in place.j(int): Index marking the end of the stem within the word buffer.k(int): Index marking the end of the entire word within the word buffer.
The internal stemming process follows these steps in order:
- Dictionary lookup: If the word is found in the dictionary as-is, return it.
- Plural handling: Remove plural suffixes (-ies, -es, -s) and check dictionary.
- Past tense handling: Remove past tense suffixes (-ied, -ed) and check dictionary.
- Progressive handling: Remove -ing and check dictionary.
- Derivational suffixes: Try removing suffixes like -tion, -ness, -ment, -able, -ity, -ive, -ize, -al, -ful, -ous, -ence, etc.
- Post-processing: If no rule produced a valid dictionary entry, apply conservative fallback rules.
At each step, the algorithm checks whether the resulting stem exists in the dictionary. This dictionary guard prevents the aggressive over-stemming that afflicts purely rule-based algorithms.
Usage
Use KStemmer.stem() during tokenization to reduce English words to their base forms. This method is typically called by the SimpleTokenizer as the final step in token processing, but it can also be used directly for standalone stemming operations.
The KStemmer is appropriate for:
- English-language text processing where readable stems are preferred over maximally conflated forms.
- Search applications where moderate stemming (higher recall than no stemming, but better precision than Porter stemming) is desired.
Code Reference
Source Location
- Repository: Vespa
- File:
linguistics/src/main/java/com/yahoo/language/simple/kstem/KStemmer.java - Lines: 1277-1281
Signature
public String stem(String term)
Class Declaration
public class KStemmer
Package
package com.yahoo.language.simple.kstem;
Key Fields
private final CharArrayMap<DictEntry> dict_ht; // Dictionary hash table
private OpenStringBuilder word; // Current word buffer
private int j; // Stem end index
private int k; // Word end index
Method Body
public String stem(String term) {
boolean changed = stem(term.toCharArray(), term.length());
if (!changed) return term;
return asString();
}
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| term | String |
Yes | A single English word to stem. Should be a lowercased, normalized token (no whitespace, no punctuation). Words shorter than 3 characters are typically returned unchanged. |
Outputs
| Name | Type | Description |
|---|---|---|
| (return value) | String |
The stemmed form of the input term. If the term was already in its root form or could not be stemmed, the original term is returned (same object reference). If stemming was applied, a new string containing the stem is returned. |
Usage Examples
Basic Usage
import com.yahoo.language.simple.kstem.KStemmer;
KStemmer stemmer = new KStemmer();
// Plural forms
System.out.println(stemmer.stem("policies")); // -> "policy"
System.out.println(stemmer.stem("watches")); // -> "watch"
System.out.println(stemmer.stem("cats")); // -> "cat"
// Past tense
System.out.println(stemmer.stem("walked")); // -> "walk"
System.out.println(stemmer.stem("tried")); // -> "try"
System.out.println(stemmer.stem("liked")); // -> "like"
// Progressive
System.out.println(stemmer.stem("walking")); // -> "walk"
System.out.println(stemmer.stem("making")); // -> "make"
System.out.println(stemmer.stem("running")); // -> "run"
Derivational Suffixes
import com.yahoo.language.simple.kstem.KStemmer;
KStemmer stemmer = new KStemmer();
// -tion suffix
System.out.println(stemmer.stem("completion")); // -> "complete"
System.out.println(stemmer.stem("connection")); // -> "connect"
// -ness suffix
System.out.println(stemmer.stem("darkness")); // -> "dark"
System.out.println(stemmer.stem("kindness")); // -> "kind"
// -ment suffix
System.out.println(stemmer.stem("enjoyment")); // -> "enjoy"
// -ive suffix
System.out.println(stemmer.stem("connective")); // -> "connect"
System.out.println(stemmer.stem("effective")); // -> "effect"
Dictionary Guard Behavior
import com.yahoo.language.simple.kstem.KStemmer;
KStemmer stemmer = new KStemmer();
// Dictionary prevents over-stemming:
// "caring" -> "care" (not "car")
System.out.println(stemmer.stem("caring")); // -> "care"
// Words already in root form are returned unchanged
System.out.println(stemmer.stem("connect")); // -> "connect"
System.out.println(stemmer.stem("walk")); // -> "walk"
// Short words are returned unchanged
System.out.println(stemmer.stem("go")); // -> "go"
System.out.println(stemmer.stem("is")); // -> "is"
Integration in Tokenization Pipeline
import com.yahoo.language.simple.kstem.KStemmer;
import com.yahoo.language.simple.SimpleNormalizer;
import com.yahoo.language.simple.SimpleTransformer;
import com.yahoo.language.Language;
SimpleNormalizer normalizer = new SimpleNormalizer();
SimpleTransformer transformer = new SimpleTransformer();
KStemmer stemmer = new KStemmer();
public String processToken(String rawToken, Language language) {
// Step 1: Unicode normalization
String normalized = normalizer.normalize(rawToken);
// Step 2: Accent removal
String noAccents = transformer.accentDrop(normalized, language);
// Step 3: Lowercasing
String lowered = noAccents.toLowerCase();
// Step 4: Stemming (final step)
return stemmer.stem(lowered);
}
// Example: processToken("Connections", Language.ENGLISH)
// -> normalize: "Connections"
// -> accentDrop: "Connections"
// -> toLowerCase: "connections"
// -> stem: "connect"