Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vespa engine Vespa KStemmer Stem

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for reducing English words to their base form provided by Vespa's linguistics library. Implements the Krovetz (KStem) stemming algorithm, which uses dictionary lookup combined with morphological suffix-stripping rules to produce readable stems.

Description

The KStemmer class implements the Krovetz stemming algorithm, a lightweight English stemmer originally derived from the Apache Lucene project. Unlike the more aggressive Porter stemmer, KStem prioritizes producing stems that are actual English words by consulting an internal dictionary before and after applying suffix-removal rules.

The stem method is the public entry point. It accepts a string term, converts it to a character array, and delegates to an internal stem(char[], int) method that performs the actual stemming logic. If the word was modified during stemming, the modified form is returned; otherwise, the original term is returned unchanged.

Key internal data structures:

  • dict_ht (CharArrayMap<DictEntry>): A hash map containing the stemmer's dictionary. Each entry maps a word to a DictEntry that indicates whether the word is a valid root form and optionally provides an exception mapping.
  • word (OpenStringBuilder): A mutable string buffer holding the current word being processed. Suffix-removal operations modify this buffer in place.
  • j (int): Index marking the end of the stem within the word buffer.
  • k (int): Index marking the end of the entire word within the word buffer.

The internal stemming process follows these steps in order:

  1. Dictionary lookup: If the word is found in the dictionary as-is, return it.
  2. Plural handling: Remove plural suffixes (-ies, -es, -s) and check dictionary.
  3. Past tense handling: Remove past tense suffixes (-ied, -ed) and check dictionary.
  4. Progressive handling: Remove -ing and check dictionary.
  5. Derivational suffixes: Try removing suffixes like -tion, -ness, -ment, -able, -ity, -ive, -ize, -al, -ful, -ous, -ence, etc.
  6. Post-processing: If no rule produced a valid dictionary entry, apply conservative fallback rules.

At each step, the algorithm checks whether the resulting stem exists in the dictionary. This dictionary guard prevents the aggressive over-stemming that afflicts purely rule-based algorithms.

Usage

Use KStemmer.stem() during tokenization to reduce English words to their base forms. This method is typically called by the SimpleTokenizer as the final step in token processing, but it can also be used directly for standalone stemming operations.

The KStemmer is appropriate for:

  • English-language text processing where readable stems are preferred over maximally conflated forms.
  • Search applications where moderate stemming (higher recall than no stemming, but better precision than Porter stemming) is desired.

Code Reference

Source Location

  • Repository: Vespa
  • File: linguistics/src/main/java/com/yahoo/language/simple/kstem/KStemmer.java
  • Lines: 1277-1281

Signature

public String stem(String term)

Class Declaration

public class KStemmer

Package

package com.yahoo.language.simple.kstem;

Key Fields

private final CharArrayMap<DictEntry> dict_ht;  // Dictionary hash table
private OpenStringBuilder word;                  // Current word buffer
private int j;                                   // Stem end index
private int k;                                   // Word end index

Method Body

public String stem(String term) {
    boolean changed = stem(term.toCharArray(), term.length());
    if (!changed) return term;
    return asString();
}

I/O Contract

Inputs

Name Type Required Description
term String Yes A single English word to stem. Should be a lowercased, normalized token (no whitespace, no punctuation). Words shorter than 3 characters are typically returned unchanged.

Outputs

Name Type Description
(return value) String The stemmed form of the input term. If the term was already in its root form or could not be stemmed, the original term is returned (same object reference). If stemming was applied, a new string containing the stem is returned.

Usage Examples

Basic Usage

import com.yahoo.language.simple.kstem.KStemmer;

KStemmer stemmer = new KStemmer();

// Plural forms
System.out.println(stemmer.stem("policies"));    // -> "policy"
System.out.println(stemmer.stem("watches"));     // -> "watch"
System.out.println(stemmer.stem("cats"));        // -> "cat"

// Past tense
System.out.println(stemmer.stem("walked"));      // -> "walk"
System.out.println(stemmer.stem("tried"));       // -> "try"
System.out.println(stemmer.stem("liked"));       // -> "like"

// Progressive
System.out.println(stemmer.stem("walking"));     // -> "walk"
System.out.println(stemmer.stem("making"));      // -> "make"
System.out.println(stemmer.stem("running"));     // -> "run"

Derivational Suffixes

import com.yahoo.language.simple.kstem.KStemmer;

KStemmer stemmer = new KStemmer();

// -tion suffix
System.out.println(stemmer.stem("completion"));  // -> "complete"
System.out.println(stemmer.stem("connection"));  // -> "connect"

// -ness suffix
System.out.println(stemmer.stem("darkness"));    // -> "dark"
System.out.println(stemmer.stem("kindness"));    // -> "kind"

// -ment suffix
System.out.println(stemmer.stem("enjoyment"));   // -> "enjoy"

// -ive suffix
System.out.println(stemmer.stem("connective"));  // -> "connect"
System.out.println(stemmer.stem("effective"));   // -> "effect"

Dictionary Guard Behavior

import com.yahoo.language.simple.kstem.KStemmer;

KStemmer stemmer = new KStemmer();

// Dictionary prevents over-stemming:
// "caring" -> "care" (not "car")
System.out.println(stemmer.stem("caring"));      // -> "care"

// Words already in root form are returned unchanged
System.out.println(stemmer.stem("connect"));     // -> "connect"
System.out.println(stemmer.stem("walk"));        // -> "walk"

// Short words are returned unchanged
System.out.println(stemmer.stem("go"));          // -> "go"
System.out.println(stemmer.stem("is"));          // -> "is"

Integration in Tokenization Pipeline

import com.yahoo.language.simple.kstem.KStemmer;
import com.yahoo.language.simple.SimpleNormalizer;
import com.yahoo.language.simple.SimpleTransformer;
import com.yahoo.language.Language;

SimpleNormalizer normalizer = new SimpleNormalizer();
SimpleTransformer transformer = new SimpleTransformer();
KStemmer stemmer = new KStemmer();

public String processToken(String rawToken, Language language) {
    // Step 1: Unicode normalization
    String normalized = normalizer.normalize(rawToken);

    // Step 2: Accent removal
    String noAccents = transformer.accentDrop(normalized, language);

    // Step 3: Lowercasing
    String lowered = noAccents.toLowerCase();

    // Step 4: Stemming (final step)
    return stemmer.stem(lowered);
}

// Example: processToken("Connections", Language.ENGLISH)
// -> normalize: "Connections"
// -> accentDrop: "Connections"
// -> toLowerCase: "connections"
// -> stem: "connect"

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment