Implementation:Vespa engine Vespa SimpleNormalizer Normalize
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for normalizing Unicode text to NFKC form provided by Vespa's linguistics library. Delegates to the Java standard library's java.text.Normalizer to ensure consistent character representation across all text processing.
Description
The SimpleNormalizer class implements the Normalizer interface and provides a straightforward Unicode normalization implementation that converts input text to NFKC (Compatibility Decomposition followed by Canonical Composition) form.
This implementation is a thin wrapper around java.text.Normalizer.normalize() from the Java standard library. It performs no additional processing -- the entire normalization is handled by the JDK's Unicode normalization implementation, which conforms to Unicode Standard Annex #15.
The choice of NFKC as the normalization form means:
- Compatibility characters are decomposed: Fullwidth Latin letters (common in CJK text), ligatures, and other compatibility variants are replaced with their standard equivalents.
- Canonical equivalents are composed: After decomposition, characters that have precomposed forms are recomposed for compactness.
- The result is idempotent: Normalizing already-normalized text returns the same string.
Usage
Use SimpleNormalizer.normalize() as an early step in the text processing pipeline, before tokenization, to ensure that all text uses consistent Unicode representations. This is especially important when processing text from multiple sources that may use different encodings for visually identical characters.
The normalizer should be applied at both index time and query time to ensure that indexed terms and query terms use the same Unicode forms.
Code Reference
Source Location
- Repository: Vespa
- File:
linguistics/src/main/java/com/yahoo/language/simple/SimpleNormalizer.java - Lines: 12-14
Signature
@Override
public String normalize(String input)
Class Declaration
public class SimpleNormalizer implements Normalizer
Package
package com.yahoo.language.simple;
Imports
import com.yahoo.language.process.Normalizer;
Method Body
@Override
public String normalize(String input) {
return java.text.Normalizer.normalize(input, java.text.Normalizer.Form.NFKC);
}
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | String |
Yes | The text to normalize. Can contain any Unicode characters. If the text is already in NFKC form, it is returned unchanged. |
Outputs
| Name | Type | Description |
|---|---|---|
| (return value) | String |
The input text converted to NFKC normalization form. Compatibility characters are replaced with canonical equivalents, and canonical decompositions are recomposed. |
Usage Examples
Basic Usage
import com.yahoo.language.simple.SimpleNormalizer;
SimpleNormalizer normalizer = new SimpleNormalizer();
// Normalize fullwidth Latin characters (common in CJK text)
String fullwidth = "\uFF28\uFF45\uFF4C\uFF4C\uFF4F"; // "Hello"
String result = normalizer.normalize(fullwidth);
// result -> "Hello" (standard Latin characters)
Ligature Decomposition
import com.yahoo.language.simple.SimpleNormalizer;
SimpleNormalizer normalizer = new SimpleNormalizer();
// Decompose ligatures
String withLigature = "\uFB01nding"; // "finding" (fi ligature)
String result = normalizer.normalize(withLigature);
// result -> "finding" (separate f and i characters)
Precomposed Character Handling
import com.yahoo.language.simple.SimpleNormalizer;
SimpleNormalizer normalizer = new SimpleNormalizer();
// Both representations normalize to the same form
String precomposed = "\u00E9"; // e (precomposed)
String decomposed = "e\u0301"; // e + combining acute accent
String result1 = normalizer.normalize(precomposed);
String result2 = normalizer.normalize(decomposed);
// result1.equals(result2) -> true (both normalize to precomposed form)
Integration in Token Processing
import com.yahoo.language.simple.SimpleNormalizer;
import com.yahoo.language.simple.SimpleTransformer;
SimpleNormalizer normalizer = new SimpleNormalizer();
SimpleTransformer transformer = new SimpleTransformer();
public String processToken(String token, Language language) {
// Step 1: Normalize Unicode to NFKC
String normalized = normalizer.normalize(token);
// Step 2: Drop accents
String noAccents = transformer.accentDrop(normalized, language);
// Step 3: Lowercase
String lowercased = noAccents.toLowerCase();
return lowercased;
}