Principle:Vespa engine Vespa Stemming

Knowledge Sources	Vespa Viewing Morphology as an Inference Process Corpus-Based Stemming Using Co-Occurrence of Word Variants (Krovetz 1993)
Domains	NLP, Text_Processing
Last Updated	2026-02-09 00:00 GMT

Overview

Stemming reduces inflected or derived words to their base or root form using a combination of dictionary lookup and morphological suffix-stripping rules, enabling search systems to match different grammatical forms of the same underlying word.

Description

In natural languages, words appear in many inflected forms. The English word "connect" can appear as "connects", "connected", "connecting", "connection", "connections", "connective", and so on. Without stemming, a search for "connecting" would fail to match a document containing only "connection", even though both words derive from the same root concept.

Stemming addresses this by mapping all inflected and derived forms of a word to a common stem. This increases recall (the fraction of relevant documents retrieved) at the potential cost of some precision (some unrelated words may share a stem).

There are several families of stemming algorithms:

Affix-stripping stemmers (e.g., Porter Stemmer) apply a sequence of suffix-removal rules with limited context. They are fast but can be overly aggressive, producing stems that are not real words ("relational" becomes "relat").
Dictionary-based stemmers (e.g., KStem/Krovetz Stemmer) first check whether the word exists in a dictionary. If it does, they return it as-is or return a known base form. If it does not, they apply conservative suffix-stripping rules. This produces more readable stems ("relational" becomes "relation").
Lemmatizers use full morphological analysis and part-of-speech information to return the actual dictionary form (lemma) of a word. They are the most accurate but also the most computationally expensive.

The KStem algorithm (Krovetz Stemmer) takes a middle-ground approach:

Dictionary lookup: Check if the word is already in the dictionary. If so, return it unchanged.
Morphological rules: Apply suffix-stripping rules for common English morphological patterns (plurals, past tense, progressive, -tion, -ness, -ment, -ble, -ity, etc.).
Post-rule dictionary check: After stripping a suffix, check whether the resulting stem is in the dictionary. If not, the stripping may be incorrect, and a different rule may be tried.

This dictionary-guided approach avoids the worst errors of purely rule-based stemmers while remaining efficient enough for real-time search applications.

Usage

Stemming should be applied:

During tokenization: As the final step in the token processing pipeline, after normalization, accent dropping, and case folding.
At both index time and query time: To ensure that query terms are stemmed to the same forms as indexed terms.
For recall-oriented search: When matching as many relevant documents as possible is more important than avoiding false matches.
For English and other morphologically rich languages: The benefit of stemming is greatest in languages with complex inflectional systems.

Stemming may not be appropriate when:

Exact matching is required: Legal search, medical terminology, or proper nouns where different forms have different meanings.
The language lacks a good stemmer: Stemming quality varies greatly by language; a poor stemmer can degrade search quality.
Subword tokenization is used: Neural models with BPE/WordPiece tokenization implicitly capture morphological variation and do not benefit from explicit stemming.

Theoretical Basis

The KStem algorithm processes English morphology through a series of suffix-specific rules. Each rule has the form:

if word ends with SUFFIX:
    candidate = word with SUFFIX removed (possibly with adjustment)
    if candidate is in dictionary or meets minimum length:
        return candidate
    else:
        try next rule

The major morphological categories handled are:

Plural Forms

"ies"  -> "y"    (e.g., "policies" -> "policy")
"es"   -> "e"    (e.g., "watches" -> "watch" or "houses" -> "house")
"s"    -> ""     (e.g., "cats" -> "cat")

Past Tense

"ied"  -> "y"    (e.g., "tried" -> "try")
"ed"   -> ""     (e.g., "walked" -> "walk")
"ed"   -> "e"    (e.g., "liked" -> "like")

Progressive

"ing"  -> ""     (e.g., "walking" -> "walk")
"ing"  -> "e"    (e.g., "making" -> "make")

Derivational Suffixes

"tion"  -> "te"  (e.g., "completion" -> "complete")
"ness"  -> ""    (e.g., "darkness" -> "dark")
"ment"  -> ""    (e.g., "enjoyment" -> "enjoy")
"able"  -> ""    (e.g., "comfortable" -> "comfort")
"ity"   -> ""    (e.g., "complexity" -> "complex")
"ive"   -> ""    (e.g., "connective" -> "connect")

The dictionary serves as a guard against over-stemming. For example, without a dictionary check, "caring" might be incorrectly stemmed to "car" (removing "ing"). The dictionary confirms that "care" is a valid word, guiding the stemmer to the correct result "care".

Key theoretical considerations:

Conflation rate: The number of distinct surface forms mapped to each stem. Higher conflation increases recall but reduces precision.
Stemming errors: Over-stemming merges unrelated words (e.g., "universal" and "university" to "univers"). Under-stemming fails to merge related words (e.g., "absorb" and "absorption").
Dictionary completeness: The quality of a dictionary-based stemmer depends on the coverage of its dictionary. Missing entries lead to fallback to rule-based behavior.
Language specificity: Stemming rules are language-specific. The KStem algorithm is designed for English; other languages require different rule sets.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment