Principle:Vespa engine Vespa Unicode Normalization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Unicode normalization converts text into a canonical representation so that equivalent character sequences -- which may differ in their byte-level encoding -- compare as identical, ensuring consistent behavior in tokenization, indexing, and matching operations.
Description
Unicode allows many characters to be represented in more than one way. For example, the character "fi" (a ligature) can be encoded as a single code point (U+FB01 LATIN SMALL LIGATURE FI) or as two separate code points (U+0066 LATIN SMALL F followed by U+0069 LATIN SMALL I). Similarly, accented characters like "e" can be encoded as a single precomposed code point (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or as a base character plus a combining mark (U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT).
Without normalization, these different encodings are treated as different strings, even though they represent the same text to a human reader. This causes serious problems in text processing:
- Search misses: A query containing one encoding will fail to match a document containing the equivalent encoding.
- Duplicate entries: Indexes may contain duplicate entries for what is logically the same term.
- Inconsistent tokenization: Tokenizers may produce different results for equivalent inputs.
Unicode defines four normalization forms:
| Form | Name | Description |
|---|---|---|
| NFC | Canonical Decomposition followed by Canonical Composition | Decomposes, then recomposes to precomposed characters where possible. Most compact form. |
| NFD | Canonical Decomposition | Fully decomposes characters into base + combining marks. |
| NFKC | Compatibility Decomposition followed by Canonical Composition | Like NFC but also replaces compatibility characters (ligatures, width variants, etc.) with their canonical equivalents. |
| NFKD | Compatibility Decomposition | Like NFD but also decomposes compatibility characters. |
For search and information retrieval, NFKC is the most commonly used form because it:
- Replaces compatibility characters (e.g., fullwidth Latin letters common in CJK text) with standard forms.
- Decomposes ligatures into their component characters.
- Produces precomposed (compact) output.
- Maximizes the chance that equivalent-looking text will match.
Usage
Unicode normalization should be applied:
- Before tokenization: To ensure consistent token boundaries regardless of the input encoding.
- Before indexing: To prevent duplicate index entries for equivalent text.
- At both index time and query time: Both sides must use the same normalization form to ensure matching.
- When processing text from diverse sources: Different systems, operating systems, and input methods may produce different Unicode encodings for the same visual text.
Normalization is generally safe to apply unconditionally. However, there are rare cases where the distinction between compatibility equivalents is meaningful (e.g., distinguishing Roman numeral characters from Latin letters), in which case NFC may be preferred over NFKC.
Theoretical Basis
Unicode normalization is defined by a formal algorithm specified in Unicode Standard Annex #15. The process operates in two phases:
Phase 1: Decomposition
Each character is recursively replaced with its decomposition mapping (if one exists) until no further decompositions are possible.
function decompose(text, useCompatibility):
result = ""
for each character c in text:
if useCompatibility and hasCompatibilityDecomposition(c):
result += decompose(compatibilityDecomposition(c), useCompatibility)
else if hasCanonicalDecomposition(c):
result += decompose(canonicalDecomposition(c), useCompatibility)
else:
result += c
return canonicalOrder(result)
After decomposition, combining marks are reordered into canonical order based on their Canonical Combining Class (CCC) values. This ensures that equivalent sequences of combining marks are ordered identically.
Phase 2: Composition (NFC/NFKC only)
The decomposed string is scanned for sequences that can be recombined into precomposed characters:
function compose(decomposedText):
result = ""
i = 0
while i < length(decomposedText):
starter = decomposedText[i]
j = i + 1
while j < length(decomposedText) and isCombiningMark(decomposedText[j]):
composite = lookupComposition(starter, decomposedText[j])
if composite exists and not blocked:
starter = composite
remove decomposedText[j]
else:
j = j + 1
result += starter
i = i + 1
return result
Key theoretical properties:
- Idempotency: Applying the same normalization form multiple times produces the same result as applying it once: normalize(normalize(x)) = normalize(x).
- Stability: Once a character is assigned a normalization mapping in the Unicode standard, it is never changed.
- Concatenation caveat: normalize(a) + normalize(b) does not always equal normalize(a + b) because combining marks at the boundary may interact.