Principle:Duckdb Duckdb Text Stemming
| Knowledge Sources | |
|---|---|
| Domains | Natural_Language_Processing, Text_Processing |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
An algorithmic process that reduces inflected or derived words to their root form (stem) by removing suffixes according to language-specific rules, enabling morphological normalization of text.
Description
Stemming is a text normalization technique used in information retrieval and natural language processing that reduces words to a common base form called a stem. For example, "running", "runs", and "ran" might all be reduced to the stem "run". The purpose is to group related word forms together so that a search for one form retrieves documents containing any related form.
The Snowball stemming framework, developed by Martin Porter (creator of the original Porter Stemmer), provides a domain-specific language for writing stemming algorithms and includes pre-built stemmers for numerous languages including English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, and many others.
Snowball stemmers operate by applying a series of suffix-stripping rules organized into steps. Each step examines the current suffix of the word and the properties of the remaining stem (such as the number of consonant-vowel sequences, called the "measure" of the stem). Rules are applied in order of decreasing suffix length, and only the first matching rule in each step is applied. This produces a deterministic mapping from word forms to stems.
Unlike lemmatization (which uses vocabulary lookup and morphological analysis to find the true dictionary form), stemming is a heuristic process that may produce stems that are not valid words. However, stemming is much faster and requires no dictionary, making it suitable for indexing large text corpora in real time.
Usage
Text stemming is used in DuckDB's Full-Text Search (FTS) extension. When building a full-text index, DuckDB stems each word in the indexed text using the appropriate language-specific Snowball stemmer. At query time, search terms are also stemmed before matching against the index. This ensures that searching for "running" also matches documents containing "runs" or "ran", significantly improving recall in text search.
Theoretical Basis
Suffix Stripping Rules: Language-specific rules organized in steps:
// English stemmer (simplified Porter algorithm)
// Step 1a: Plural suffixes
"sses" -> "ss" (caresses -> caress)
"ies" -> "i" (ponies -> poni)
"ss" -> "ss" (caress -> caress)
"s" -> "" (cats -> cat)
// Step 1b: Verbal suffixes
"eed" -> "ee" if measure(stem) > 0 (agreed -> agree)
"ed" -> "" if stem contains vowel (plastered -> plaster)
"ing" -> "" if stem contains vowel (motoring -> motor)
Stem Measure: Counting consonant-vowel sequences:
// The "measure" m of a stem is the number of VC sequences
// C = consonant sequence, V = vowel sequence
// Pattern: [C](VC){m}[V]
// Examples:
// "tree" = CVCV -> m = 1
// "trouble" = CCVVCVC -> m = 2
// "oats" = VCCC -> m = 1
// "by" = CV -> m = 0
function measure(stem):
m = 0
in_vowel = false
for each letter in stem:
if is_vowel(letter):
in_vowel = true
else if in_vowel:
m += 1
in_vowel = false
return m
Multi-Step Processing: Rules applied in sequence:
function stem(word):
// Step 1: Remove plurals and past participles
word = apply_step1a(word) // sses -> ss, ies -> i, s -> ""
word = apply_step1b(word) // eed -> ee, ed -> "", ing -> ""
word = apply_step1c(word) // y -> i if stem has vowel
// Step 2: Map double suffixes to single
word = apply_step2(word) // ational -> ate, tional -> tion
// Step 3: Deal with -ic, -full, -ness, etc.
word = apply_step3(word) // icate -> ic, ative -> ""
// Step 4: Remove -ance, -ence, -er, -ic, etc.
word = apply_step4(word) // al -> "", ance -> "", ence -> ""
// Step 5: Clean up
word = apply_step5(word) // e -> "" if m > 1, ll -> l
return word
// Example: "generalizations" ->
// Step 1a: "generalization"
// Step 2: "generalize"
// Step 3: "general"
// Step 4: "gener"