Implementation:Duckdb Duckdb Snowball Stemmer

Knowledge Sources	Duckdb_Duckdb Snowball
Domains	Text_Processing, Third_Party
Last Updated	2026-02-07 12:00 GMT

Overview

The Snowball Stemmer library provides algorithmic word stemming for 27 languages, used by DuckDB's full-text search (FTS) extension to reduce words to their root forms during indexing and querying.

Description

Snowball is a string-processing language designed for creating stemming algorithms. DuckDB embeds the libstemmer C implementation generated from the Snowball compiler. The library is organized into three layers:

Public API (libstemmer.h / libstemmer.cpp) -- A factory interface that creates stemmer objects by algorithm name and character encoding. Callers use sb_stemmer_new to instantiate a stemmer, sb_stemmer_stem to reduce a word to its stem, and sb_stemmer_delete to free resources.
Module Registry (modules.h) -- A static lookup table mapping language names (full names, ISO 639-2, and ISO 639-1 codes) to their corresponding create, close, and stem function pointers. The registry also defines the stemmer_encoding_t enum (supporting ENC_UTF_8).
Runtime Utilities (runtime/utilities.cpp) -- Low-level helpers for UTF-8 character traversal (skip_utf8), character grouping checks, string replacement, and cursor management within the SN_env execution environment.

Each language stemmer is a generated C++ source file (e.g., stem_UTF_8_english.cpp) that implements the create_env, close_env, and stem functions for that language. The internal SN_env struct holds the working string buffer, cursor positions (c, l, lb, bra, ket), and auxiliary integer/string arrays used during rule evaluation.

Usage

DuckDB uses the Snowball stemmer library in its Full-Text Search (FTS) extension. When creating a full-text index, DuckDB tokenizes text and passes each token through the appropriate language stemmer to produce root forms. This normalization allows queries like "running" to match documents containing "run", "runs", or "ran". The stemmer is selected by language name at index creation time, supporting all 27 registered languages.

Code Reference

Source Location

Repository: Duckdb_Duckdb
Files (MANUAL_REVIEW APPROVED):
- third_party/snowball/libstemmer/libstemmer.cpp -- stemmer factory (96 lines)
- third_party/snowball/libstemmer/libstemmer.h -- public API (78 lines)
- third_party/snowball/libstemmer/modules.h -- module registry (182 lines)
Files (AUTO_KEEP):
- third_party/snowball/runtime/utilities.cpp -- runtime utilities (503 lines)
- third_party/snowball/src_c/stem_UTF_8_arabic.cpp (1676 lines)
- third_party/snowball/src_c/stem_UTF_8_basque.cpp (1184 lines)
- third_party/snowball/src_c/stem_UTF_8_catalan.cpp (1449 lines)
- third_party/snowball/src_c/stem_UTF_8_danish.cpp (318 lines)
- third_party/snowball/src_c/stem_UTF_8_dutch.cpp (613 lines)
- third_party/snowball/src_c/stem_UTF_8_english.cpp (1074 lines)
- third_party/snowball/src_c/stem_UTF_8_finnish.cpp (722 lines)
- third_party/snowball/src_c/stem_UTF_8_french.cpp (1262 lines)
- third_party/snowball/src_c/stem_UTF_8_german.cpp (500 lines)
- third_party/snowball/src_c/stem_UTF_8_german2.cpp (541 lines)
- third_party/snowball/src_c/stem_UTF_8_greek.cpp (3718 lines)
- third_party/snowball/src_c/stem_UTF_8_hindi.cpp (332 lines)
- third_party/snowball/src_c/stem_UTF_8_hungarian.cpp (868 lines)
- third_party/snowball/src_c/stem_UTF_8_indonesian.cpp (407 lines)
- third_party/snowball/src_c/stem_UTF_8_irish.cpp (479 lines)
- third_party/snowball/src_c/stem_UTF_8_italian.cpp (1030 lines)
- third_party/snowball/src_c/stem_UTF_8_kraaij_pohlmann.cpp (1591 lines)
- third_party/snowball/src_c/stem_UTF_8_lithuanian.cpp (837 lines)
- third_party/snowball/src_c/stem_UTF_8_lovins.cpp (1718 lines)
- third_party/snowball/src_c/stem_UTF_8_nepali.cpp (421 lines)
- third_party/snowball/src_c/stem_UTF_8_porter.cpp (723 lines)
- third_party/snowball/src_c/stem_UTF_8_portuguese.cpp (967 lines)
- third_party/snowball/src_c/stem_UTF_8_romanian.cpp (971 lines)
- third_party/snowball/src_c/stem_UTF_8_russian.cpp (678 lines)
- third_party/snowball/src_c/stem_UTF_8_serbian.cpp (6543 lines)
- third_party/snowball/src_c/stem_UTF_8_spanish.cpp (1045 lines)
- third_party/snowball/src_c/stem_UTF_8_tamil.cpp (1878 lines)
- third_party/snowball/src_c/stem_UTF_8_turkish.cpp (2096 lines)

Signature

// Public API (libstemmer.h)
typedef unsigned char sb_symbol;

const char **          sb_stemmer_list(void);
struct sb_stemmer *    sb_stemmer_new(const char * algorithm, const char * charenc);
void                   sb_stemmer_delete(struct sb_stemmer * stemmer);
const sb_symbol *      sb_stemmer_stem(struct sb_stemmer * stemmer,
                                       const sb_symbol * word, int size);
int                    sb_stemmer_length(struct sb_stemmer * stemmer);

// Internal runtime (runtime/api.h)
struct SN_env {
    symbol * p;
    int c; int l; int lb; int bra; int ket;
    symbol * * S;
    int * I;
};

struct SN_env * SN_create_env(int S_size, int I_size);
void            SN_close_env(struct SN_env * z, int S_size);
int             SN_set_current(struct SN_env * z, int size, const symbol * s);

// Per-language module interface (e.g., for English)
struct SN_env * english_UTF_8_create_env(void);
void            english_UTF_8_close_env(struct SN_env *);
int             english_UTF_8_stem(struct SN_env *);

// Runtime utilities (runtime/utilities.cpp)
symbol * create_s(void);
void     lose_s(symbol * p);
int      skip_utf8(const symbol * p, int c, int lb, int l, int n);
int      in_grouping_U(struct SN_env * z, const unsigned char * s, int min, int max, int repeat);
int      in_grouping_b_U(struct SN_env * z, const unsigned char * s, int min, int max, int repeat);

// Module registry types (modules.h)
typedef enum {
    ENC_UNKNOWN = 0,
    ENC_ISO_8859_1,
    ENC_ISO_8859_2,
    ENC_KOI8_R,
    ENC_UTF_8
} stemmer_encoding_t;

struct stemmer_modules {
    const char * name;
    stemmer_encoding_t enc;
    struct SN_env * (*create)(void);
    void (*close)(struct SN_env *);
    int (*stem)(struct SN_env *);
};

Import

#include "libstemmer.h"            // Public API for stemmer creation and usage
#include "../runtime/api.h"        // Internal SN_env environment structures
#include "modules.h"               // Module registry with per-language function pointers

I/O Contract

Inputs

Name	Type	Required	Description
`algorithm`	`const char *`	Yes	Language name or ISO 639 code (e.g., "english", "en", "eng"). Case-sensitive, lowercase.
`charenc`	`const char *`	No	Character encoding name (e.g., "UTF_8"). Defaults to UTF-8 if NULL.
`word`	`const sb_symbol *`	Yes	Pointer to the input word bytes to be stemmed.
`size`	`int`	Yes	Length of the input word in bytes.

Outputs

Name	Type	Description
`sb_stemmer *`	`struct sb_stemmer *`	Opaque handle to a stemmer instance. NULL on failure (unknown algorithm or out of memory).
`sb_stemmer_stem` return	`const sb_symbol *`	Pointer to the stemmed word (owned by the stemmer, null-terminated). NULL on out-of-memory error.
`sb_stemmer_length` return	`int`	Length in bytes of the most recently stemmed result.
`sb_stemmer_list` return	`const char **`	Null-terminated array of canonical algorithm names (27 languages).

Supported Languages

Language	ISO 639-1	ISO 639-2	Source File
Arabic	ar	ara	`stem_UTF_8_arabic.cpp`
Basque	eu	baq, eus	`stem_UTF_8_basque.cpp`
Catalan	ca	cat	`stem_UTF_8_catalan.cpp`
Danish	da	dan	`stem_UTF_8_danish.cpp`
Dutch	nl	dut, nld	`stem_UTF_8_dutch.cpp`
English	en	eng	`stem_UTF_8_english.cpp`
Finnish	fi	fin	`stem_UTF_8_finnish.cpp`
French	fr	fra, fre	`stem_UTF_8_french.cpp`
German	de	deu, ger	`stem_UTF_8_german.cpp`
Greek	el	ell, gre	`stem_UTF_8_greek.cpp`
Hindi	hi	hin	`stem_UTF_8_hindi.cpp`
Hungarian	hu	hun	`stem_UTF_8_hungarian.cpp`
Indonesian	id	ind	`stem_UTF_8_indonesian.cpp`
Irish	ga	gle	`stem_UTF_8_irish.cpp`
Italian	it	ita	`stem_UTF_8_italian.cpp`
Lithuanian	lt	lit	`stem_UTF_8_lithuanian.cpp`
Nepali	ne	nep	`stem_UTF_8_nepali.cpp`
Norwegian	no	nor	`stem_UTF_8_norwegian.cpp` (not listed as separate src_c file)
Porter	--	--	`stem_UTF_8_porter.cpp`
Portuguese	pt	por	`stem_UTF_8_portuguese.cpp`
Romanian	ro	ron, rum	`stem_UTF_8_romanian.cpp`
Russian	ru	rus	`stem_UTF_8_russian.cpp`
Serbian	sr	srp	`stem_UTF_8_serbian.cpp`
Spanish	es	esl, spa	`stem_UTF_8_spanish.cpp`
Swedish	sv	swe	`stem_UTF_8_swedish.cpp` (not listed as separate src_c file)
Tamil	ta	tam	`stem_UTF_8_tamil.cpp`
Turkish	tr	tur	`stem_UTF_8_turkish.cpp`

Additional stemmer source files (present in the repository but not registered in the module table by canonical name):

stem_UTF_8_german2.cpp -- Alternative German stemmer variant
stem_UTF_8_kraaij_pohlmann.cpp -- Kraaij-Pohlmann Dutch stemmer variant
stem_UTF_8_lovins.cpp -- Lovins stemmer algorithm

Usage Examples

#include "libstemmer.h"

// List all available stemming algorithms
const char ** algorithms = sb_stemmer_list();
for (int i = 0; algorithms[i] != NULL; i++) {
    printf("Algorithm: %s\n", algorithms[i]);
}

// Create an English stemmer (UTF-8 encoding by default)
struct sb_stemmer * stemmer = sb_stemmer_new("english", NULL);
if (stemmer == NULL) {
    // Algorithm not found or out of memory
    return -1;
}

// Stem a word
const char * word = "running";
const sb_symbol * result = sb_stemmer_stem(stemmer,
    (const sb_symbol *)word, strlen(word));
if (result != NULL) {
    int len = sb_stemmer_length(stemmer);
    printf("Stemmed: %.*s\n", len, (const char *)result);
    // Output: "run"
}

// Clean up
sb_stemmer_delete(stemmer);

Related Pages

Principle:Duckdb_Duckdb_Text_Stemming

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment