Implementation:Vespa engine Vespa SimpleTokenizer Tokenize
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for breaking text into annotated tokens at character-type boundaries provided by Vespa's linguistics library. Integrates normalization, transformation, stemming, and special token recognition into a unified tokenization pipeline.
Description
The SimpleTokenizer class implements the Tokenizer interface and provides Vespa's default tokenization pipeline. It splits input text into tokens at character-type transitions (letter-to-digit, digit-to-symbol, etc.) and applies a chain of processing steps to each token.
The tokenizer maintains references to several collaborating components:
- Normalizer: A
Normalizerinstance (typicallySimpleNormalizer) for Unicode NFKC normalization. - Transformer: A
Transformerinstance (typicallySimpleTransformer) for accent dropping and case folding. - KStemmer: A
KStemmerinstance for English stemming. - SpecialTokenRegistry: A registry of predefined tokens (URLs, product codes, etc.) that should be recognized as single units rather than being split.
The tokenize method is a thin dispatch method that delegates to an internal tokenization engine with a token processor callback. The actual work happens in two phases:
- Token splitting: The input text is scanned character by character. At each transition between character types (as determined by Unicode general category), a new token boundary is created.
- Token processing: Each token is passed through the
processTokenmethod, which applies normalization, accent dropping, lowercasing, and stemming based on theLinguisticsParameterssettings.
The LinguisticsParameters object controls which processing steps are applied:
- Whether stemming is enabled.
- Whether accent removal is enabled.
- The target language for language-specific processing.
Usage
Use SimpleTokenizer.tokenize() for converting document or query text into terms for indexing or matching. This is the central component of Vespa's text processing pipeline and is invoked automatically by the indexing and search infrastructure.
Direct usage is appropriate when:
- Building custom text processing workflows outside the standard Vespa indexing pipeline.
- Testing tokenization behavior on specific inputs.
- Integrating Vespa's tokenization into external tools or analysis scripts.
Code Reference
Source Location
- Repository: Vespa
- File:
linguistics/src/main/java/com/yahoo/language/simple/SimpleTokenizer.java - Lines: 54-57
Signature
@Override
public Iterable<Token> tokenize(String input, LinguisticsParameters parameters)
Class Declaration
public class SimpleTokenizer implements Tokenizer
Package
package com.yahoo.language.simple;
Key Fields
private final Normalizer normalizer;
private final Transformer transformer;
private final KStemmer stemmer;
private final SpecialTokenRegistry specialTokenRegistry;
Method Body
@Override
public Iterable<Token> tokenize(String input, LinguisticsParameters parameters) {
return tokenize(input, token -> processToken(token, parameters));
}
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | String |
Yes | The text to tokenize. Can be any length; the tokenizer processes it character by character. |
| parameters | LinguisticsParameters |
Yes | Controls tokenization behavior including: the target language, whether stemming is enabled, whether accent removal is enabled, and the stemming mode. |
LinguisticsParameters Fields
| Field | Type | Description |
|---|---|---|
| language | Language |
The language of the input text, used for language-specific stemming and transformation. |
| stemMode | StemMode |
Controls stemming behavior: NONE, DEFAULT, ALL, SHORTEST, or BEST.
|
| removeAccents | boolean |
Whether to strip diacritical marks from tokens. |
Outputs
| Name | Type | Description |
|---|---|---|
| (return value) | Iterable<Token> |
A lazy iterable of tokens. Each Token carries: the original string, the processed (normalized/stemmed) string, the token type (ALPHABETIC, NUMERIC, SYMBOL, SPACE, etc.), and positional information (offset and length in the original input).
|
Usage Examples
Basic Usage
import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.process.Token;
import com.yahoo.language.Language;
SimpleTokenizer tokenizer = new SimpleTokenizer();
LinguisticsParameters params = new LinguisticsParameters(
Language.ENGLISH,
StemMode.DEFAULT,
true // removeAccents
);
Iterable<Token> tokens = tokenizer.tokenize("The quick brown foxes jumped!", params);
for (Token token : tokens) {
System.out.println(token.getOrig() + " -> " + token.getTokenString()
+ " [" + token.getType() + "]");
}
// Output:
// The -> the [ALPHABETIC]
// quick -> quick [ALPHABETIC]
// brown -> brown [ALPHABETIC]
// foxes -> fox [ALPHABETIC] (stemmed)
// jumped -> jump [ALPHABETIC] (stemmed)
// ! -> ! [PUNCTUATION]
Without Stemming
import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.process.Token;
import com.yahoo.language.Language;
SimpleTokenizer tokenizer = new SimpleTokenizer();
LinguisticsParameters params = new LinguisticsParameters(
Language.ENGLISH,
StemMode.NONE,
false // do not remove accents
);
Iterable<Token> tokens = tokenizer.tokenize("cafe resume", params);
for (Token token : tokens) {
System.out.println(token.getOrig() + " -> " + token.getTokenString());
}
// Output preserves accents and does not stem:
// cafe -> cafe [ALPHABETIC]
// resume -> resume [ALPHABETIC]
With Custom Normalizer and Transformer
import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.simple.SimpleNormalizer;
import com.yahoo.language.simple.SimpleTransformer;
// Construct tokenizer with explicit dependencies
SimpleNormalizer normalizer = new SimpleNormalizer();
SimpleTransformer transformer = new SimpleTransformer();
SimpleTokenizer tokenizer = new SimpleTokenizer(normalizer, transformer);
// Tokenize with the configured pipeline
Iterable<Token> tokens = tokenizer.tokenize(inputText, params);