Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vespa engine Vespa SimpleTransformer AccentDrop

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for removing diacritical marks from text provided by Vespa's linguistics library. Uses Unicode NFD decomposition followed by regex-based combining mark removal to produce accent-free text for insensitive matching.

Description

The SimpleTransformer class implements the Transformer interface and provides accent dropping (diacritical mark removal) as its primary transformation capability.

The accentDrop method implements the standard two-step accent removal algorithm:

  1. NFD decomposition: The input string is normalized to Unicode Normalization Form D (Canonical Decomposition) using java.text.Normalizer.normalize(input, NFD). This separates precomposed characters into base characters followed by combining marks. For example, "e" (U+00E9) is decomposed into "e" (U+0065) + combining acute accent (U+0301).
  2. Combining mark removal: A precompiled regex pattern \p{InCombiningDiacriticalMarks}+ is applied to remove all characters in the Combining Diacritical Marks Unicode block (U+0300 through U+036F).

The regex pattern is compiled once as a static constant, ensuring that it is not recompiled on every method invocation. This is important for performance since accentDrop is called for every token during tokenization.

The language parameter is accepted but not currently used -- the same accent-dropping logic is applied regardless of language. This parameter exists for future extensibility (e.g., language-specific rules for Turkish dotted/dotless i).

Usage

Use SimpleTransformer.accentDrop() during token processing to normalize accented characters to their base forms. This is typically invoked as part of the SimpleTokenizer pipeline, but it can also be called directly when accent-insensitive string comparison or matching is needed.

Code Reference

Source Location

  • Repository: Vespa
  • File: linguistics/src/main/java/com/yahoo/language/simple/SimpleTransformer.java
  • Lines: 21-23

Signature

@Override
public String accentDrop(String input, Language language)

Class Declaration

public class SimpleTransformer implements Transformer

Package

package com.yahoo.language.simple;

Key Constant

private static final Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");

Method Body

@Override
public String accentDrop(String input, Language language) {
    return pattern.matcher(Normalizer.normalize(input, Normalizer.Form.NFD)).replaceAll("");
}

I/O Contract

Inputs

Name Type Required Description
input String Yes The text from which to remove diacritical marks. Can contain any Unicode characters. Characters without diacritical marks pass through unchanged.
language Language Yes The language of the input text. Currently not used by the implementation (the same logic applies to all languages), but accepted for interface compliance and future extensibility.

Outputs

Name Type Description
(return value) String The input text with all combining diacritical marks removed. Base characters are preserved. For example, "cafe" becomes "cafe" and "uber" becomes "uber".

Usage Examples

Basic Usage

import com.yahoo.language.simple.SimpleTransformer;
import com.yahoo.language.Language;

SimpleTransformer transformer = new SimpleTransformer();

// Remove accents from French text
String result = transformer.accentDrop("creme brulee", Language.FRENCH);
// result -> "creme brulee"

// Remove accents from German text
String german = transformer.accentDrop("Munchen Ubersicht", Language.GERMAN);
// german -> "Munchen Ubersicht"

Multiple Diacritical Marks

import com.yahoo.language.simple.SimpleTransformer;
import com.yahoo.language.Language;

SimpleTransformer transformer = new SimpleTransformer();

// Characters with multiple combining marks
String vietnamese = transformer.accentDrop("Viet Nam", Language.UNKNOWN);
// All combining marks are removed regardless of count

// Spanish text
String spanish = transformer.accentDrop("El nino esta aqui manana", Language.SPANISH);
// spanish -> "El nino esta aqui manana"

Integration with Tokenizer

import com.yahoo.language.simple.SimpleTransformer;
import com.yahoo.language.simple.SimpleNormalizer;
import com.yahoo.language.Language;

SimpleNormalizer normalizer = new SimpleNormalizer();
SimpleTransformer transformer = new SimpleTransformer();

public String normalizeToken(String token, Language language) {
    // Step 1: NFKC normalize (handle compatibility characters)
    String normalized = normalizer.normalize(token);

    // Step 2: Drop accents (NFD decompose + remove combining marks)
    String accentFree = transformer.accentDrop(normalized, language);

    // Step 3: Lowercase
    return accentFree.toLowerCase();
}

// Example: normalizeToken("Resume", Language.ENGLISH)
// -> NFKC: "Resume" (no change)
// -> accentDrop: "Resume"
// -> toLowerCase: "resume"

Idempotency Demonstration

import com.yahoo.language.simple.SimpleTransformer;
import com.yahoo.language.Language;

SimpleTransformer transformer = new SimpleTransformer();

String input = "cafe";
String once = transformer.accentDrop(input, Language.ENGLISH);
String twice = transformer.accentDrop(once, Language.ENGLISH);
// once.equals(twice) -> true (applying accent drop twice gives the same result)
// once -> "cafe"

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment