Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Vespa engine Vespa Unicode Normalization

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Unicode normalization converts text into a canonical representation so that equivalent character sequences -- which may differ in their byte-level encoding -- compare as identical, ensuring consistent behavior in tokenization, indexing, and matching operations.

Description

Unicode allows many characters to be represented in more than one way. For example, the character "fi" (a ligature) can be encoded as a single code point (U+FB01 LATIN SMALL LIGATURE FI) or as two separate code points (U+0066 LATIN SMALL F followed by U+0069 LATIN SMALL I). Similarly, accented characters like "e" can be encoded as a single precomposed code point (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or as a base character plus a combining mark (U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT).

Without normalization, these different encodings are treated as different strings, even though they represent the same text to a human reader. This causes serious problems in text processing:

  • Search misses: A query containing one encoding will fail to match a document containing the equivalent encoding.
  • Duplicate entries: Indexes may contain duplicate entries for what is logically the same term.
  • Inconsistent tokenization: Tokenizers may produce different results for equivalent inputs.

Unicode defines four normalization forms:

Form Name Description
NFC Canonical Decomposition followed by Canonical Composition Decomposes, then recomposes to precomposed characters where possible. Most compact form.
NFD Canonical Decomposition Fully decomposes characters into base + combining marks.
NFKC Compatibility Decomposition followed by Canonical Composition Like NFC but also replaces compatibility characters (ligatures, width variants, etc.) with their canonical equivalents.
NFKD Compatibility Decomposition Like NFD but also decomposes compatibility characters.

For search and information retrieval, NFKC is the most commonly used form because it:

  • Replaces compatibility characters (e.g., fullwidth Latin letters common in CJK text) with standard forms.
  • Decomposes ligatures into their component characters.
  • Produces precomposed (compact) output.
  • Maximizes the chance that equivalent-looking text will match.

Usage

Unicode normalization should be applied:

  • Before tokenization: To ensure consistent token boundaries regardless of the input encoding.
  • Before indexing: To prevent duplicate index entries for equivalent text.
  • At both index time and query time: Both sides must use the same normalization form to ensure matching.
  • When processing text from diverse sources: Different systems, operating systems, and input methods may produce different Unicode encodings for the same visual text.

Normalization is generally safe to apply unconditionally. However, there are rare cases where the distinction between compatibility equivalents is meaningful (e.g., distinguishing Roman numeral characters from Latin letters), in which case NFC may be preferred over NFKC.

Theoretical Basis

Unicode normalization is defined by a formal algorithm specified in Unicode Standard Annex #15. The process operates in two phases:

Phase 1: Decomposition

Each character is recursively replaced with its decomposition mapping (if one exists) until no further decompositions are possible.

function decompose(text, useCompatibility):
    result = ""
    for each character c in text:
        if useCompatibility and hasCompatibilityDecomposition(c):
            result += decompose(compatibilityDecomposition(c), useCompatibility)
        else if hasCanonicalDecomposition(c):
            result += decompose(canonicalDecomposition(c), useCompatibility)
        else:
            result += c
    return canonicalOrder(result)

After decomposition, combining marks are reordered into canonical order based on their Canonical Combining Class (CCC) values. This ensures that equivalent sequences of combining marks are ordered identically.

Phase 2: Composition (NFC/NFKC only)

The decomposed string is scanned for sequences that can be recombined into precomposed characters:

function compose(decomposedText):
    result = ""
    i = 0
    while i < length(decomposedText):
        starter = decomposedText[i]
        j = i + 1
        while j < length(decomposedText) and isCombiningMark(decomposedText[j]):
            composite = lookupComposition(starter, decomposedText[j])
            if composite exists and not blocked:
                starter = composite
                remove decomposedText[j]
            else:
                j = j + 1
        result += starter
        i = i + 1
    return result

Key theoretical properties:

  • Idempotency: Applying the same normalization form multiple times produces the same result as applying it once: normalize(normalize(x)) = normalize(x).
  • Stability: Once a character is assigned a normalization mapping in the Unicode standard, it is never changed.
  • Concatenation caveat: normalize(a) + normalize(b) does not always equal normalize(a + b) because combining marks at the boundary may interact.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment