Implementation:Vespa engine Vespa SimpleTokenizer Tokenize

Knowledge Sources	Vespa
Domains	NLP, Text_Processing
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for breaking text into annotated tokens at character-type boundaries provided by Vespa's linguistics library. Integrates normalization, transformation, stemming, and special token recognition into a unified tokenization pipeline.

Description

The SimpleTokenizer class implements the Tokenizer interface and provides Vespa's default tokenization pipeline. It splits input text into tokens at character-type transitions (letter-to-digit, digit-to-symbol, etc.) and applies a chain of processing steps to each token.

The tokenizer maintains references to several collaborating components:

Normalizer: A Normalizer instance (typically SimpleNormalizer) for Unicode NFKC normalization.
Transformer: A Transformer instance (typically SimpleTransformer) for accent dropping and case folding.
KStemmer: A KStemmer instance for English stemming.
SpecialTokenRegistry: A registry of predefined tokens (URLs, product codes, etc.) that should be recognized as single units rather than being split.

The tokenize method is a thin dispatch method that delegates to an internal tokenization engine with a token processor callback. The actual work happens in two phases:

Token splitting: The input text is scanned character by character. At each transition between character types (as determined by Unicode general category), a new token boundary is created.
Token processing: Each token is passed through the processToken method, which applies normalization, accent dropping, lowercasing, and stemming based on the LinguisticsParameters settings.

The LinguisticsParameters object controls which processing steps are applied:

Whether stemming is enabled.
Whether accent removal is enabled.
The target language for language-specific processing.

Usage

Use SimpleTokenizer.tokenize() for converting document or query text into terms for indexing or matching. This is the central component of Vespa's text processing pipeline and is invoked automatically by the indexing and search infrastructure.

Direct usage is appropriate when:

Building custom text processing workflows outside the standard Vespa indexing pipeline.
Testing tokenization behavior on specific inputs.
Integrating Vespa's tokenization into external tools or analysis scripts.

Code Reference

Source Location

Repository: Vespa
File: linguistics/src/main/java/com/yahoo/language/simple/SimpleTokenizer.java
Lines: 54-57

Signature

@Override
public Iterable<Token> tokenize(String input, LinguisticsParameters parameters)

Class Declaration

public class SimpleTokenizer implements Tokenizer

Package

package com.yahoo.language.simple;

Key Fields

private final Normalizer normalizer;
private final Transformer transformer;
private final KStemmer stemmer;
private final SpecialTokenRegistry specialTokenRegistry;

Method Body

@Override
public Iterable<Token> tokenize(String input, LinguisticsParameters parameters) {
    return tokenize(input, token -> processToken(token, parameters));
}

I/O Contract

Inputs

Name	Type	Required	Description
input	`String`	Yes	The text to tokenize. Can be any length; the tokenizer processes it character by character.
parameters	`LinguisticsParameters`	Yes	Controls tokenization behavior including: the target language, whether stemming is enabled, whether accent removal is enabled, and the stemming mode.

LinguisticsParameters Fields

Field	Type	Description
language	`Language`	The language of the input text, used for language-specific stemming and transformation.
stemMode	`StemMode`	Controls stemming behavior: `NONE`, `DEFAULT`, `ALL`, `SHORTEST`, or `BEST`.
removeAccents	`boolean`	Whether to strip diacritical marks from tokens.

Outputs

Name	Type	Description
(return value)	`Iterable<Token>`	A lazy iterable of tokens. Each `Token` carries: the original string, the processed (normalized/stemmed) string, the token type (ALPHABETIC, NUMERIC, SYMBOL, SPACE, etc.), and positional information (offset and length in the original input).

Usage Examples

Basic Usage

import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.process.Token;
import com.yahoo.language.Language;

SimpleTokenizer tokenizer = new SimpleTokenizer();

LinguisticsParameters params = new LinguisticsParameters(
    Language.ENGLISH,
    StemMode.DEFAULT,
    true  // removeAccents
);

Iterable<Token> tokens = tokenizer.tokenize("The quick brown foxes jumped!", params);

for (Token token : tokens) {
    System.out.println(token.getOrig() + " -> " + token.getTokenString()
                       + " [" + token.getType() + "]");
}
// Output:
// The -> the [ALPHABETIC]
// quick -> quick [ALPHABETIC]
// brown -> brown [ALPHABETIC]
// foxes -> fox [ALPHABETIC]      (stemmed)
// jumped -> jump [ALPHABETIC]    (stemmed)
// ! -> ! [PUNCTUATION]

Without Stemming

import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.process.Token;
import com.yahoo.language.Language;

SimpleTokenizer tokenizer = new SimpleTokenizer();

LinguisticsParameters params = new LinguisticsParameters(
    Language.ENGLISH,
    StemMode.NONE,
    false  // do not remove accents
);

Iterable<Token> tokens = tokenizer.tokenize("cafe resume", params);

for (Token token : tokens) {
    System.out.println(token.getOrig() + " -> " + token.getTokenString());
}
// Output preserves accents and does not stem:
// cafe -> cafe [ALPHABETIC]
// resume -> resume [ALPHABETIC]

With Custom Normalizer and Transformer

import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.simple.SimpleNormalizer;
import com.yahoo.language.simple.SimpleTransformer;

// Construct tokenizer with explicit dependencies
SimpleNormalizer normalizer = new SimpleNormalizer();
SimpleTransformer transformer = new SimpleTransformer();

SimpleTokenizer tokenizer = new SimpleTokenizer(normalizer, transformer);

// Tokenize with the configured pipeline
Iterable<Token> tokens = tokenizer.tokenize(inputText, params);

Related Pages

Implements Principle

Principle:Vespa_engine_Vespa_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment