Implementation:Duckdb Duckdb Snowball Stemmer
| Knowledge Sources | |
|---|---|
| Domains | Text_Processing, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
The Snowball Stemmer library provides algorithmic word stemming for 27 languages, used by DuckDB's full-text search (FTS) extension to reduce words to their root forms during indexing and querying.
Description
Snowball is a string-processing language designed for creating stemming algorithms. DuckDB embeds the libstemmer C implementation generated from the Snowball compiler. The library is organized into three layers:
- Public API (
libstemmer.h/libstemmer.cpp) -- A factory interface that creates stemmer objects by algorithm name and character encoding. Callers usesb_stemmer_newto instantiate a stemmer,sb_stemmer_stemto reduce a word to its stem, andsb_stemmer_deleteto free resources. - Module Registry (
modules.h) -- A static lookup table mapping language names (full names, ISO 639-2, and ISO 639-1 codes) to their correspondingcreate,close, andstemfunction pointers. The registry also defines thestemmer_encoding_tenum (supportingENC_UTF_8). - Runtime Utilities (
runtime/utilities.cpp) -- Low-level helpers for UTF-8 character traversal (skip_utf8), character grouping checks, string replacement, and cursor management within theSN_envexecution environment.
Each language stemmer is a generated C++ source file (e.g., stem_UTF_8_english.cpp) that implements the create_env, close_env, and stem functions for that language. The internal SN_env struct holds the working string buffer, cursor positions (c, l, lb, bra, ket), and auxiliary integer/string arrays used during rule evaluation.
Usage
DuckDB uses the Snowball stemmer library in its Full-Text Search (FTS) extension. When creating a full-text index, DuckDB tokenizes text and passes each token through the appropriate language stemmer to produce root forms. This normalization allows queries like "running" to match documents containing "run", "runs", or "ran". The stemmer is selected by language name at index creation time, supporting all 27 registered languages.
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- Files (MANUAL_REVIEW APPROVED):
- third_party/snowball/libstemmer/libstemmer.cpp -- stemmer factory (96 lines)
- third_party/snowball/libstemmer/libstemmer.h -- public API (78 lines)
- third_party/snowball/libstemmer/modules.h -- module registry (182 lines)
- Files (AUTO_KEEP):
- third_party/snowball/runtime/utilities.cpp -- runtime utilities (503 lines)
- third_party/snowball/src_c/stem_UTF_8_arabic.cpp (1676 lines)
- third_party/snowball/src_c/stem_UTF_8_basque.cpp (1184 lines)
- third_party/snowball/src_c/stem_UTF_8_catalan.cpp (1449 lines)
- third_party/snowball/src_c/stem_UTF_8_danish.cpp (318 lines)
- third_party/snowball/src_c/stem_UTF_8_dutch.cpp (613 lines)
- third_party/snowball/src_c/stem_UTF_8_english.cpp (1074 lines)
- third_party/snowball/src_c/stem_UTF_8_finnish.cpp (722 lines)
- third_party/snowball/src_c/stem_UTF_8_french.cpp (1262 lines)
- third_party/snowball/src_c/stem_UTF_8_german.cpp (500 lines)
- third_party/snowball/src_c/stem_UTF_8_german2.cpp (541 lines)
- third_party/snowball/src_c/stem_UTF_8_greek.cpp (3718 lines)
- third_party/snowball/src_c/stem_UTF_8_hindi.cpp (332 lines)
- third_party/snowball/src_c/stem_UTF_8_hungarian.cpp (868 lines)
- third_party/snowball/src_c/stem_UTF_8_indonesian.cpp (407 lines)
- third_party/snowball/src_c/stem_UTF_8_irish.cpp (479 lines)
- third_party/snowball/src_c/stem_UTF_8_italian.cpp (1030 lines)
- third_party/snowball/src_c/stem_UTF_8_kraaij_pohlmann.cpp (1591 lines)
- third_party/snowball/src_c/stem_UTF_8_lithuanian.cpp (837 lines)
- third_party/snowball/src_c/stem_UTF_8_lovins.cpp (1718 lines)
- third_party/snowball/src_c/stem_UTF_8_nepali.cpp (421 lines)
- third_party/snowball/src_c/stem_UTF_8_porter.cpp (723 lines)
- third_party/snowball/src_c/stem_UTF_8_portuguese.cpp (967 lines)
- third_party/snowball/src_c/stem_UTF_8_romanian.cpp (971 lines)
- third_party/snowball/src_c/stem_UTF_8_russian.cpp (678 lines)
- third_party/snowball/src_c/stem_UTF_8_serbian.cpp (6543 lines)
- third_party/snowball/src_c/stem_UTF_8_spanish.cpp (1045 lines)
- third_party/snowball/src_c/stem_UTF_8_tamil.cpp (1878 lines)
- third_party/snowball/src_c/stem_UTF_8_turkish.cpp (2096 lines)
Signature
// Public API (libstemmer.h)
typedef unsigned char sb_symbol;
const char ** sb_stemmer_list(void);
struct sb_stemmer * sb_stemmer_new(const char * algorithm, const char * charenc);
void sb_stemmer_delete(struct sb_stemmer * stemmer);
const sb_symbol * sb_stemmer_stem(struct sb_stemmer * stemmer,
const sb_symbol * word, int size);
int sb_stemmer_length(struct sb_stemmer * stemmer);
// Internal runtime (runtime/api.h)
struct SN_env {
symbol * p;
int c; int l; int lb; int bra; int ket;
symbol * * S;
int * I;
};
struct SN_env * SN_create_env(int S_size, int I_size);
void SN_close_env(struct SN_env * z, int S_size);
int SN_set_current(struct SN_env * z, int size, const symbol * s);
// Per-language module interface (e.g., for English)
struct SN_env * english_UTF_8_create_env(void);
void english_UTF_8_close_env(struct SN_env *);
int english_UTF_8_stem(struct SN_env *);
// Runtime utilities (runtime/utilities.cpp)
symbol * create_s(void);
void lose_s(symbol * p);
int skip_utf8(const symbol * p, int c, int lb, int l, int n);
int in_grouping_U(struct SN_env * z, const unsigned char * s, int min, int max, int repeat);
int in_grouping_b_U(struct SN_env * z, const unsigned char * s, int min, int max, int repeat);
// Module registry types (modules.h)
typedef enum {
ENC_UNKNOWN = 0,
ENC_ISO_8859_1,
ENC_ISO_8859_2,
ENC_KOI8_R,
ENC_UTF_8
} stemmer_encoding_t;
struct stemmer_modules {
const char * name;
stemmer_encoding_t enc;
struct SN_env * (*create)(void);
void (*close)(struct SN_env *);
int (*stem)(struct SN_env *);
};
Import
#include "libstemmer.h" // Public API for stemmer creation and usage
#include "../runtime/api.h" // Internal SN_env environment structures
#include "modules.h" // Module registry with per-language function pointers
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
algorithm |
const char * |
Yes | Language name or ISO 639 code (e.g., "english", "en", "eng"). Case-sensitive, lowercase. |
charenc |
const char * |
No | Character encoding name (e.g., "UTF_8"). Defaults to UTF-8 if NULL. |
word |
const sb_symbol * |
Yes | Pointer to the input word bytes to be stemmed. |
size |
int |
Yes | Length of the input word in bytes. |
Outputs
| Name | Type | Description |
|---|---|---|
sb_stemmer * |
struct sb_stemmer * |
Opaque handle to a stemmer instance. NULL on failure (unknown algorithm or out of memory). |
sb_stemmer_stem return |
const sb_symbol * |
Pointer to the stemmed word (owned by the stemmer, null-terminated). NULL on out-of-memory error. |
sb_stemmer_length return |
int |
Length in bytes of the most recently stemmed result. |
sb_stemmer_list return |
const char ** |
Null-terminated array of canonical algorithm names (27 languages). |
Supported Languages
| Language | ISO 639-1 | ISO 639-2 | Source File |
|---|---|---|---|
| Arabic | ar | ara | stem_UTF_8_arabic.cpp
|
| Basque | eu | baq, eus | stem_UTF_8_basque.cpp
|
| Catalan | ca | cat | stem_UTF_8_catalan.cpp
|
| Danish | da | dan | stem_UTF_8_danish.cpp
|
| Dutch | nl | dut, nld | stem_UTF_8_dutch.cpp
|
| English | en | eng | stem_UTF_8_english.cpp
|
| Finnish | fi | fin | stem_UTF_8_finnish.cpp
|
| French | fr | fra, fre | stem_UTF_8_french.cpp
|
| German | de | deu, ger | stem_UTF_8_german.cpp
|
| Greek | el | ell, gre | stem_UTF_8_greek.cpp
|
| Hindi | hi | hin | stem_UTF_8_hindi.cpp
|
| Hungarian | hu | hun | stem_UTF_8_hungarian.cpp
|
| Indonesian | id | ind | stem_UTF_8_indonesian.cpp
|
| Irish | ga | gle | stem_UTF_8_irish.cpp
|
| Italian | it | ita | stem_UTF_8_italian.cpp
|
| Lithuanian | lt | lit | stem_UTF_8_lithuanian.cpp
|
| Nepali | ne | nep | stem_UTF_8_nepali.cpp
|
| Norwegian | no | nor | stem_UTF_8_norwegian.cpp (not listed as separate src_c file)
|
| Porter | -- | -- | stem_UTF_8_porter.cpp
|
| Portuguese | pt | por | stem_UTF_8_portuguese.cpp
|
| Romanian | ro | ron, rum | stem_UTF_8_romanian.cpp
|
| Russian | ru | rus | stem_UTF_8_russian.cpp
|
| Serbian | sr | srp | stem_UTF_8_serbian.cpp
|
| Spanish | es | esl, spa | stem_UTF_8_spanish.cpp
|
| Swedish | sv | swe | stem_UTF_8_swedish.cpp (not listed as separate src_c file)
|
| Tamil | ta | tam | stem_UTF_8_tamil.cpp
|
| Turkish | tr | tur | stem_UTF_8_turkish.cpp
|
Additional stemmer source files (present in the repository but not registered in the module table by canonical name):
stem_UTF_8_german2.cpp-- Alternative German stemmer variantstem_UTF_8_kraaij_pohlmann.cpp-- Kraaij-Pohlmann Dutch stemmer variantstem_UTF_8_lovins.cpp-- Lovins stemmer algorithm
Usage Examples
#include "libstemmer.h"
// List all available stemming algorithms
const char ** algorithms = sb_stemmer_list();
for (int i = 0; algorithms[i] != NULL; i++) {
printf("Algorithm: %s\n", algorithms[i]);
}
// Create an English stemmer (UTF-8 encoding by default)
struct sb_stemmer * stemmer = sb_stemmer_new("english", NULL);
if (stemmer == NULL) {
// Algorithm not found or out of memory
return -1;
}
// Stem a word
const char * word = "running";
const sb_symbol * result = sb_stemmer_stem(stemmer,
(const sb_symbol *)word, strlen(word));
if (result != NULL) {
int len = sb_stemmer_length(stemmer);
printf("Stemmed: %.*s\n", len, (const char *)result);
// Output: "run"
}
// Clean up
sb_stemmer_delete(stemmer);