Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vespa engine Vespa SimpleNormalizer Normalize

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for normalizing Unicode text to NFKC form provided by Vespa's linguistics library. Delegates to the Java standard library's java.text.Normalizer to ensure consistent character representation across all text processing.

Description

The SimpleNormalizer class implements the Normalizer interface and provides a straightforward Unicode normalization implementation that converts input text to NFKC (Compatibility Decomposition followed by Canonical Composition) form.

This implementation is a thin wrapper around java.text.Normalizer.normalize() from the Java standard library. It performs no additional processing -- the entire normalization is handled by the JDK's Unicode normalization implementation, which conforms to Unicode Standard Annex #15.

The choice of NFKC as the normalization form means:

  • Compatibility characters are decomposed: Fullwidth Latin letters (common in CJK text), ligatures, and other compatibility variants are replaced with their standard equivalents.
  • Canonical equivalents are composed: After decomposition, characters that have precomposed forms are recomposed for compactness.
  • The result is idempotent: Normalizing already-normalized text returns the same string.

Usage

Use SimpleNormalizer.normalize() as an early step in the text processing pipeline, before tokenization, to ensure that all text uses consistent Unicode representations. This is especially important when processing text from multiple sources that may use different encodings for visually identical characters.

The normalizer should be applied at both index time and query time to ensure that indexed terms and query terms use the same Unicode forms.

Code Reference

Source Location

  • Repository: Vespa
  • File: linguistics/src/main/java/com/yahoo/language/simple/SimpleNormalizer.java
  • Lines: 12-14

Signature

@Override
public String normalize(String input)

Class Declaration

public class SimpleNormalizer implements Normalizer

Package

package com.yahoo.language.simple;

Imports

import com.yahoo.language.process.Normalizer;

Method Body

@Override
public String normalize(String input) {
    return java.text.Normalizer.normalize(input, java.text.Normalizer.Form.NFKC);
}

I/O Contract

Inputs

Name Type Required Description
input String Yes The text to normalize. Can contain any Unicode characters. If the text is already in NFKC form, it is returned unchanged.

Outputs

Name Type Description
(return value) String The input text converted to NFKC normalization form. Compatibility characters are replaced with canonical equivalents, and canonical decompositions are recomposed.

Usage Examples

Basic Usage

import com.yahoo.language.simple.SimpleNormalizer;

SimpleNormalizer normalizer = new SimpleNormalizer();

// Normalize fullwidth Latin characters (common in CJK text)
String fullwidth = "\uFF28\uFF45\uFF4C\uFF4C\uFF4F";  // "Hello"
String result = normalizer.normalize(fullwidth);
// result -> "Hello" (standard Latin characters)

Ligature Decomposition

import com.yahoo.language.simple.SimpleNormalizer;

SimpleNormalizer normalizer = new SimpleNormalizer();

// Decompose ligatures
String withLigature = "\uFB01nding";  // "finding" (fi ligature)
String result = normalizer.normalize(withLigature);
// result -> "finding" (separate f and i characters)

Precomposed Character Handling

import com.yahoo.language.simple.SimpleNormalizer;

SimpleNormalizer normalizer = new SimpleNormalizer();

// Both representations normalize to the same form
String precomposed = "\u00E9";         // e (precomposed)
String decomposed = "e\u0301";        // e + combining acute accent

String result1 = normalizer.normalize(precomposed);
String result2 = normalizer.normalize(decomposed);
// result1.equals(result2) -> true (both normalize to precomposed form)

Integration in Token Processing

import com.yahoo.language.simple.SimpleNormalizer;
import com.yahoo.language.simple.SimpleTransformer;

SimpleNormalizer normalizer = new SimpleNormalizer();
SimpleTransformer transformer = new SimpleTransformer();

public String processToken(String token, Language language) {
    // Step 1: Normalize Unicode to NFKC
    String normalized = normalizer.normalize(token);

    // Step 2: Drop accents
    String noAccents = transformer.accentDrop(normalized, language);

    // Step 3: Lowercase
    String lowercased = noAccents.toLowerCase();

    return lowercased;
}

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment