Principle:Vespa engine Vespa Text Transformation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Text transformation removes diacritical marks (accents, tildes, umlauts, cedillas) from characters to enable accent-insensitive matching, using Unicode canonical decomposition followed by combining mark removal.
Description
Diacritical marks are glyphs added to base letters to modify their pronunciation or meaning. Examples include:
- Acute accent: e (as in "resume")
- Umlaut: u (as in "uber")
- Tilde: n (as in "canon")
- Cedilla: c (as in "facade")
In information retrieval, users frequently search without diacritical marks, either because their keyboard does not support them, because they are unfamiliar with the correct accented form, or simply out of convenience. A search for "resume" should match documents containing "resume", and vice versa.
Accent dropping (also called diacritical mark removal or accent folding) addresses this by stripping combining marks from text at both index time and query time, producing a normalized form where accented and unaccented variants map to the same representation.
The technique relies on a two-step process:
- NFD decomposition: The input string is decomposed using Unicode Normalization Form D (Canonical Decomposition). This separates precomposed characters into their base character plus combining marks. For example, "e" (U+00E9) becomes "e" (U+0065) followed by the combining acute accent (U+0301).
- Combining mark removal: All characters in the Unicode "Combining Diacritical Marks" block (U+0300 to U+036F) are removed from the decomposed string, leaving only the base characters.
This approach is language-independent and handles all Unicode diacritical marks uniformly. However, it is a lossy transformation -- distinct characters that differ only in their diacritical marks become identical. In some languages, this may merge words with different meanings (e.g., in Turkish, "i" and "i" with a dot above are distinct letters). For most information retrieval applications, the benefit of increased recall outweighs this theoretical precision loss.
Usage
Accent dropping should be applied:
- During tokenization: As part of the token processing pipeline, after normalization and before or alongside case folding.
- At both index time and query time: Both must apply the same transformation for matching to work.
- When building accent-insensitive search: This is the standard approach for European language search.
- In combination with case folding: Accent dropping and lowercasing together produce a maximally normalized token form.
Accent dropping may not be appropriate when:
- The application requires accent-sensitive matching (e.g., a dictionary application).
- The language treats accented characters as entirely separate letters (e.g., Swedish a-ring).
- The distinction between accented forms carries important semantic meaning.
Theoretical Basis
The accent dropping algorithm can be expressed precisely as a composition of NFD normalization and regex-based combining mark removal:
function accentDrop(input):
// Step 1: Decompose to NFD form
// This separates base characters from combining marks
decomposed = unicodeNormalize(input, NFD)
// Step 2: Remove all combining diacritical marks
// The Unicode block \p{InCombiningDiacriticalMarks} covers U+0300..U+036F
result = regexReplace(decomposed, pattern="\p{InCombiningDiacriticalMarks}+", replacement="")
return result
Detailed Example
Consider the input string "Cliche resume naive":
| Step | Value | Explanation |
|---|---|---|
| Input | Cliche resume naive | Original text with accented characters |
| NFD Decomposition | Cliché resumé naïve | Accents separated as combining marks |
| Remove Combining Marks | Cliche resume naive | All combining diacritical marks stripped |
Character-Level Detail
For the character "e" (U+00E9):
| Step | Code Points | Description |
|---|---|---|
| Original | U+00E9 | LATIN SMALL LETTER E WITH ACUTE (precomposed) |
| NFD Decomposition | U+0065 U+0301 | LATIN SMALL LETTER E + COMBINING ACUTE ACCENT |
| Mark Removal | U+0065 | LATIN SMALL LETTER E (base character only) |
Key theoretical considerations:
- NFD vs. NFKD: Using NFD preserves compatibility characters while only decomposing canonical equivalences. Using NFKD would additionally decompose compatibility equivalences (ligatures, width variants), which may or may not be desired.
- Regex scope: The pattern
\p{InCombiningDiacriticalMarks}covers the basic combining diacritical marks block. Some scripts use combining marks outside this block; a more comprehensive approach would use\p{M}(all Unicode marks) but this risks removing marks that are essential in certain scripts. - Composability: Accent dropping is designed to compose with other transformations (case folding, NFKC normalization) in a text processing pipeline. The order of operations matters: accent dropping should generally occur before NFKC re-normalization.