Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Vespa engine Vespa Text Transformation

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Text transformation removes diacritical marks (accents, tildes, umlauts, cedillas) from characters to enable accent-insensitive matching, using Unicode canonical decomposition followed by combining mark removal.

Description

Diacritical marks are glyphs added to base letters to modify their pronunciation or meaning. Examples include:

  • Acute accent: e (as in "resume")
  • Umlaut: u (as in "uber")
  • Tilde: n (as in "canon")
  • Cedilla: c (as in "facade")

In information retrieval, users frequently search without diacritical marks, either because their keyboard does not support them, because they are unfamiliar with the correct accented form, or simply out of convenience. A search for "resume" should match documents containing "resume", and vice versa.

Accent dropping (also called diacritical mark removal or accent folding) addresses this by stripping combining marks from text at both index time and query time, producing a normalized form where accented and unaccented variants map to the same representation.

The technique relies on a two-step process:

  1. NFD decomposition: The input string is decomposed using Unicode Normalization Form D (Canonical Decomposition). This separates precomposed characters into their base character plus combining marks. For example, "e" (U+00E9) becomes "e" (U+0065) followed by the combining acute accent (U+0301).
  2. Combining mark removal: All characters in the Unicode "Combining Diacritical Marks" block (U+0300 to U+036F) are removed from the decomposed string, leaving only the base characters.

This approach is language-independent and handles all Unicode diacritical marks uniformly. However, it is a lossy transformation -- distinct characters that differ only in their diacritical marks become identical. In some languages, this may merge words with different meanings (e.g., in Turkish, "i" and "i" with a dot above are distinct letters). For most information retrieval applications, the benefit of increased recall outweighs this theoretical precision loss.

Usage

Accent dropping should be applied:

  • During tokenization: As part of the token processing pipeline, after normalization and before or alongside case folding.
  • At both index time and query time: Both must apply the same transformation for matching to work.
  • When building accent-insensitive search: This is the standard approach for European language search.
  • In combination with case folding: Accent dropping and lowercasing together produce a maximally normalized token form.

Accent dropping may not be appropriate when:

  • The application requires accent-sensitive matching (e.g., a dictionary application).
  • The language treats accented characters as entirely separate letters (e.g., Swedish a-ring).
  • The distinction between accented forms carries important semantic meaning.

Theoretical Basis

The accent dropping algorithm can be expressed precisely as a composition of NFD normalization and regex-based combining mark removal:

function accentDrop(input):
    // Step 1: Decompose to NFD form
    // This separates base characters from combining marks
    decomposed = unicodeNormalize(input, NFD)

    // Step 2: Remove all combining diacritical marks
    // The Unicode block \p{InCombiningDiacriticalMarks} covers U+0300..U+036F
    result = regexReplace(decomposed, pattern="\p{InCombiningDiacriticalMarks}+", replacement="")

    return result

Detailed Example

Consider the input string "Cliche resume naive":

Step Value Explanation
Input Cliche resume naive Original text with accented characters
NFD Decomposition Cliché resumé naïve Accents separated as combining marks
Remove Combining Marks Cliche resume naive All combining diacritical marks stripped

Character-Level Detail

For the character "e" (U+00E9):

Step Code Points Description
Original U+00E9 LATIN SMALL LETTER E WITH ACUTE (precomposed)
NFD Decomposition U+0065 U+0301 LATIN SMALL LETTER E + COMBINING ACUTE ACCENT
Mark Removal U+0065 LATIN SMALL LETTER E (base character only)

Key theoretical considerations:

  • NFD vs. NFKD: Using NFD preserves compatibility characters while only decomposing canonical equivalences. Using NFKD would additionally decompose compatibility equivalences (ligatures, width variants), which may or may not be desired.
  • Regex scope: The pattern \p{InCombiningDiacriticalMarks} covers the basic combining diacritical marks block. Some scripts use combining marks outside this block; a more comprehensive approach would use \p{M} (all Unicode marks) but this risks removing marks that are essential in certain scripts.
  • Composability: Accent dropping is designed to compose with other transformations (case folding, NFKC normalization) in a text processing pipeline. The order of operations matters: accent dropping should generally occur before NFKC re-normalization.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment