Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vespa engine Vespa SimpleTokenizer Tokenize

From Leeroopedia
Revision as of 17:04, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Vespa_engine_Vespa_SimpleTokenizer_Tokenize.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for breaking text into annotated tokens at character-type boundaries provided by Vespa's linguistics library. Integrates normalization, transformation, stemming, and special token recognition into a unified tokenization pipeline.

Description

The SimpleTokenizer class implements the Tokenizer interface and provides Vespa's default tokenization pipeline. It splits input text into tokens at character-type transitions (letter-to-digit, digit-to-symbol, etc.) and applies a chain of processing steps to each token.

The tokenizer maintains references to several collaborating components:

  • Normalizer: A Normalizer instance (typically SimpleNormalizer) for Unicode NFKC normalization.
  • Transformer: A Transformer instance (typically SimpleTransformer) for accent dropping and case folding.
  • KStemmer: A KStemmer instance for English stemming.
  • SpecialTokenRegistry: A registry of predefined tokens (URLs, product codes, etc.) that should be recognized as single units rather than being split.

The tokenize method is a thin dispatch method that delegates to an internal tokenization engine with a token processor callback. The actual work happens in two phases:

  1. Token splitting: The input text is scanned character by character. At each transition between character types (as determined by Unicode general category), a new token boundary is created.
  2. Token processing: Each token is passed through the processToken method, which applies normalization, accent dropping, lowercasing, and stemming based on the LinguisticsParameters settings.

The LinguisticsParameters object controls which processing steps are applied:

  • Whether stemming is enabled.
  • Whether accent removal is enabled.
  • The target language for language-specific processing.

Usage

Use SimpleTokenizer.tokenize() for converting document or query text into terms for indexing or matching. This is the central component of Vespa's text processing pipeline and is invoked automatically by the indexing and search infrastructure.

Direct usage is appropriate when:

  • Building custom text processing workflows outside the standard Vespa indexing pipeline.
  • Testing tokenization behavior on specific inputs.
  • Integrating Vespa's tokenization into external tools or analysis scripts.

Code Reference

Source Location

  • Repository: Vespa
  • File: linguistics/src/main/java/com/yahoo/language/simple/SimpleTokenizer.java
  • Lines: 54-57

Signature

@Override
public Iterable<Token> tokenize(String input, LinguisticsParameters parameters)

Class Declaration

public class SimpleTokenizer implements Tokenizer

Package

package com.yahoo.language.simple;

Key Fields

private final Normalizer normalizer;
private final Transformer transformer;
private final KStemmer stemmer;
private final SpecialTokenRegistry specialTokenRegistry;

Method Body

@Override
public Iterable<Token> tokenize(String input, LinguisticsParameters parameters) {
    return tokenize(input, token -> processToken(token, parameters));
}

I/O Contract

Inputs

Name Type Required Description
input String Yes The text to tokenize. Can be any length; the tokenizer processes it character by character.
parameters LinguisticsParameters Yes Controls tokenization behavior including: the target language, whether stemming is enabled, whether accent removal is enabled, and the stemming mode.

LinguisticsParameters Fields

Field Type Description
language Language The language of the input text, used for language-specific stemming and transformation.
stemMode StemMode Controls stemming behavior: NONE, DEFAULT, ALL, SHORTEST, or BEST.
removeAccents boolean Whether to strip diacritical marks from tokens.

Outputs

Name Type Description
(return value) Iterable<Token> A lazy iterable of tokens. Each Token carries: the original string, the processed (normalized/stemmed) string, the token type (ALPHABETIC, NUMERIC, SYMBOL, SPACE, etc.), and positional information (offset and length in the original input).

Usage Examples

Basic Usage

import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.process.Token;
import com.yahoo.language.Language;

SimpleTokenizer tokenizer = new SimpleTokenizer();

LinguisticsParameters params = new LinguisticsParameters(
    Language.ENGLISH,
    StemMode.DEFAULT,
    true  // removeAccents
);

Iterable<Token> tokens = tokenizer.tokenize("The quick brown foxes jumped!", params);

for (Token token : tokens) {
    System.out.println(token.getOrig() + " -> " + token.getTokenString()
                       + " [" + token.getType() + "]");
}
// Output:
// The -> the [ALPHABETIC]
// quick -> quick [ALPHABETIC]
// brown -> brown [ALPHABETIC]
// foxes -> fox [ALPHABETIC]      (stemmed)
// jumped -> jump [ALPHABETIC]    (stemmed)
// ! -> ! [PUNCTUATION]

Without Stemming

import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.process.Token;
import com.yahoo.language.Language;

SimpleTokenizer tokenizer = new SimpleTokenizer();

LinguisticsParameters params = new LinguisticsParameters(
    Language.ENGLISH,
    StemMode.NONE,
    false  // do not remove accents
);

Iterable<Token> tokens = tokenizer.tokenize("cafe resume", params);

for (Token token : tokens) {
    System.out.println(token.getOrig() + " -> " + token.getTokenString());
}
// Output preserves accents and does not stem:
// cafe -> cafe [ALPHABETIC]
// resume -> resume [ALPHABETIC]

With Custom Normalizer and Transformer

import com.yahoo.language.simple.SimpleTokenizer;
import com.yahoo.language.simple.SimpleNormalizer;
import com.yahoo.language.simple.SimpleTransformer;

// Construct tokenizer with explicit dependencies
SimpleNormalizer normalizer = new SimpleNormalizer();
SimpleTransformer transformer = new SimpleTransformer();

SimpleTokenizer tokenizer = new SimpleTokenizer(normalizer, transformer);

// Tokenize with the configured pipeline
Iterable<Token> tokens = tokenizer.tokenize(inputText, params);

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment