Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb Snowball Stemmer

From Leeroopedia


Knowledge Sources
Domains Text_Processing, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

The Snowball Stemmer library provides algorithmic word stemming for 27 languages, used by DuckDB's full-text search (FTS) extension to reduce words to their root forms during indexing and querying.

Description

Snowball is a string-processing language designed for creating stemming algorithms. DuckDB embeds the libstemmer C implementation generated from the Snowball compiler. The library is organized into three layers:

  1. Public API (libstemmer.h / libstemmer.cpp) -- A factory interface that creates stemmer objects by algorithm name and character encoding. Callers use sb_stemmer_new to instantiate a stemmer, sb_stemmer_stem to reduce a word to its stem, and sb_stemmer_delete to free resources.
  2. Module Registry (modules.h) -- A static lookup table mapping language names (full names, ISO 639-2, and ISO 639-1 codes) to their corresponding create, close, and stem function pointers. The registry also defines the stemmer_encoding_t enum (supporting ENC_UTF_8).
  3. Runtime Utilities (runtime/utilities.cpp) -- Low-level helpers for UTF-8 character traversal (skip_utf8), character grouping checks, string replacement, and cursor management within the SN_env execution environment.

Each language stemmer is a generated C++ source file (e.g., stem_UTF_8_english.cpp) that implements the create_env, close_env, and stem functions for that language. The internal SN_env struct holds the working string buffer, cursor positions (c, l, lb, bra, ket), and auxiliary integer/string arrays used during rule evaluation.

Usage

DuckDB uses the Snowball stemmer library in its Full-Text Search (FTS) extension. When creating a full-text index, DuckDB tokenizes text and passes each token through the appropriate language stemmer to produce root forms. This normalization allows queries like "running" to match documents containing "run", "runs", or "ran". The stemmer is selected by language name at index creation time, supporting all 27 registered languages.

Code Reference

Source Location

Signature

// Public API (libstemmer.h)
typedef unsigned char sb_symbol;

const char **          sb_stemmer_list(void);
struct sb_stemmer *    sb_stemmer_new(const char * algorithm, const char * charenc);
void                   sb_stemmer_delete(struct sb_stemmer * stemmer);
const sb_symbol *      sb_stemmer_stem(struct sb_stemmer * stemmer,
                                       const sb_symbol * word, int size);
int                    sb_stemmer_length(struct sb_stemmer * stemmer);

// Internal runtime (runtime/api.h)
struct SN_env {
    symbol * p;
    int c; int l; int lb; int bra; int ket;
    symbol * * S;
    int * I;
};

struct SN_env * SN_create_env(int S_size, int I_size);
void            SN_close_env(struct SN_env * z, int S_size);
int             SN_set_current(struct SN_env * z, int size, const symbol * s);

// Per-language module interface (e.g., for English)
struct SN_env * english_UTF_8_create_env(void);
void            english_UTF_8_close_env(struct SN_env *);
int             english_UTF_8_stem(struct SN_env *);

// Runtime utilities (runtime/utilities.cpp)
symbol * create_s(void);
void     lose_s(symbol * p);
int      skip_utf8(const symbol * p, int c, int lb, int l, int n);
int      in_grouping_U(struct SN_env * z, const unsigned char * s, int min, int max, int repeat);
int      in_grouping_b_U(struct SN_env * z, const unsigned char * s, int min, int max, int repeat);

// Module registry types (modules.h)
typedef enum {
    ENC_UNKNOWN = 0,
    ENC_ISO_8859_1,
    ENC_ISO_8859_2,
    ENC_KOI8_R,
    ENC_UTF_8
} stemmer_encoding_t;

struct stemmer_modules {
    const char * name;
    stemmer_encoding_t enc;
    struct SN_env * (*create)(void);
    void (*close)(struct SN_env *);
    int (*stem)(struct SN_env *);
};

Import

#include "libstemmer.h"            // Public API for stemmer creation and usage
#include "../runtime/api.h"        // Internal SN_env environment structures
#include "modules.h"               // Module registry with per-language function pointers

I/O Contract

Inputs

Name Type Required Description
algorithm const char * Yes Language name or ISO 639 code (e.g., "english", "en", "eng"). Case-sensitive, lowercase.
charenc const char * No Character encoding name (e.g., "UTF_8"). Defaults to UTF-8 if NULL.
word const sb_symbol * Yes Pointer to the input word bytes to be stemmed.
size int Yes Length of the input word in bytes.

Outputs

Name Type Description
sb_stemmer * struct sb_stemmer * Opaque handle to a stemmer instance. NULL on failure (unknown algorithm or out of memory).
sb_stemmer_stem return const sb_symbol * Pointer to the stemmed word (owned by the stemmer, null-terminated). NULL on out-of-memory error.
sb_stemmer_length return int Length in bytes of the most recently stemmed result.
sb_stemmer_list return const char ** Null-terminated array of canonical algorithm names (27 languages).

Supported Languages

Language ISO 639-1 ISO 639-2 Source File
Arabic ar ara stem_UTF_8_arabic.cpp
Basque eu baq, eus stem_UTF_8_basque.cpp
Catalan ca cat stem_UTF_8_catalan.cpp
Danish da dan stem_UTF_8_danish.cpp
Dutch nl dut, nld stem_UTF_8_dutch.cpp
English en eng stem_UTF_8_english.cpp
Finnish fi fin stem_UTF_8_finnish.cpp
French fr fra, fre stem_UTF_8_french.cpp
German de deu, ger stem_UTF_8_german.cpp
Greek el ell, gre stem_UTF_8_greek.cpp
Hindi hi hin stem_UTF_8_hindi.cpp
Hungarian hu hun stem_UTF_8_hungarian.cpp
Indonesian id ind stem_UTF_8_indonesian.cpp
Irish ga gle stem_UTF_8_irish.cpp
Italian it ita stem_UTF_8_italian.cpp
Lithuanian lt lit stem_UTF_8_lithuanian.cpp
Nepali ne nep stem_UTF_8_nepali.cpp
Norwegian no nor stem_UTF_8_norwegian.cpp (not listed as separate src_c file)
Porter -- -- stem_UTF_8_porter.cpp
Portuguese pt por stem_UTF_8_portuguese.cpp
Romanian ro ron, rum stem_UTF_8_romanian.cpp
Russian ru rus stem_UTF_8_russian.cpp
Serbian sr srp stem_UTF_8_serbian.cpp
Spanish es esl, spa stem_UTF_8_spanish.cpp
Swedish sv swe stem_UTF_8_swedish.cpp (not listed as separate src_c file)
Tamil ta tam stem_UTF_8_tamil.cpp
Turkish tr tur stem_UTF_8_turkish.cpp

Additional stemmer source files (present in the repository but not registered in the module table by canonical name):

  • stem_UTF_8_german2.cpp -- Alternative German stemmer variant
  • stem_UTF_8_kraaij_pohlmann.cpp -- Kraaij-Pohlmann Dutch stemmer variant
  • stem_UTF_8_lovins.cpp -- Lovins stemmer algorithm

Usage Examples

#include "libstemmer.h"

// List all available stemming algorithms
const char ** algorithms = sb_stemmer_list();
for (int i = 0; algorithms[i] != NULL; i++) {
    printf("Algorithm: %s\n", algorithms[i]);
}

// Create an English stemmer (UTF-8 encoding by default)
struct sb_stemmer * stemmer = sb_stemmer_new("english", NULL);
if (stemmer == NULL) {
    // Algorithm not found or out of memory
    return -1;
}

// Stem a word
const char * word = "running";
const sb_symbol * result = sb_stemmer_stem(stemmer,
    (const sb_symbol *)word, strlen(word));
if (result != NULL) {
    int len = sb_stemmer_length(stemmer);
    printf("Stemmed: %.*s\n", len, (const char *)result);
    // Output: "run"
}

// Clean up
sb_stemmer_delete(stemmer);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment