Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb Text Stemming

From Leeroopedia


Knowledge Sources
Domains Natural_Language_Processing, Text_Processing
Last Updated 2026-02-07 12:00 GMT

Overview

An algorithmic process that reduces inflected or derived words to their root form (stem) by removing suffixes according to language-specific rules, enabling morphological normalization of text.

Description

Stemming is a text normalization technique used in information retrieval and natural language processing that reduces words to a common base form called a stem. For example, "running", "runs", and "ran" might all be reduced to the stem "run". The purpose is to group related word forms together so that a search for one form retrieves documents containing any related form.

The Snowball stemming framework, developed by Martin Porter (creator of the original Porter Stemmer), provides a domain-specific language for writing stemming algorithms and includes pre-built stemmers for numerous languages including English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, and many others.

Snowball stemmers operate by applying a series of suffix-stripping rules organized into steps. Each step examines the current suffix of the word and the properties of the remaining stem (such as the number of consonant-vowel sequences, called the "measure" of the stem). Rules are applied in order of decreasing suffix length, and only the first matching rule in each step is applied. This produces a deterministic mapping from word forms to stems.

Unlike lemmatization (which uses vocabulary lookup and morphological analysis to find the true dictionary form), stemming is a heuristic process that may produce stems that are not valid words. However, stemming is much faster and requires no dictionary, making it suitable for indexing large text corpora in real time.

Usage

Text stemming is used in DuckDB's Full-Text Search (FTS) extension. When building a full-text index, DuckDB stems each word in the indexed text using the appropriate language-specific Snowball stemmer. At query time, search terms are also stemmed before matching against the index. This ensures that searching for "running" also matches documents containing "runs" or "ran", significantly improving recall in text search.

Theoretical Basis

Suffix Stripping Rules: Language-specific rules organized in steps:

// English stemmer (simplified Porter algorithm)
// Step 1a: Plural suffixes
"sses" -> "ss"    (caresses -> caress)
"ies"  -> "i"     (ponies -> poni)
"ss"   -> "ss"    (caress -> caress)
"s"    -> ""       (cats -> cat)

// Step 1b: Verbal suffixes
"eed"  -> "ee"    if measure(stem) > 0  (agreed -> agree)
"ed"   -> ""      if stem contains vowel (plastered -> plaster)
"ing"  -> ""      if stem contains vowel (motoring -> motor)

Stem Measure: Counting consonant-vowel sequences:

// The "measure" m of a stem is the number of VC sequences
// C = consonant sequence, V = vowel sequence
// Pattern: [C](VC){m}[V]

// Examples:
// "tree"    = CVCV    -> m = 1
// "trouble" = CCVVCVC -> m = 2
// "oats"    = VCCC    -> m = 1
// "by"      = CV      -> m = 0

function measure(stem):
    m = 0
    in_vowel = false
    for each letter in stem:
        if is_vowel(letter):
            in_vowel = true
        else if in_vowel:
            m += 1
            in_vowel = false
    return m

Multi-Step Processing: Rules applied in sequence:

function stem(word):
    // Step 1: Remove plurals and past participles
    word = apply_step1a(word)   // sses -> ss, ies -> i, s -> ""
    word = apply_step1b(word)   // eed -> ee, ed -> "", ing -> ""
    word = apply_step1c(word)   // y -> i if stem has vowel

    // Step 2: Map double suffixes to single
    word = apply_step2(word)    // ational -> ate, tional -> tion

    // Step 3: Deal with -ic, -full, -ness, etc.
    word = apply_step3(word)    // icate -> ic, ative -> ""

    // Step 4: Remove -ance, -ence, -er, -ic, etc.
    word = apply_step4(word)    // al -> "", ance -> "", ence -> ""

    // Step 5: Clean up
    word = apply_step5(word)    // e -> "" if m > 1, ll -> l
    return word

// Example: "generalizations" ->
//   Step 1a: "generalization"
//   Step 2:  "generalize"
//   Step 3:  "general"
//   Step 4:  "gener"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment