Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper BasicTextNormalizer

From Leeroopedia
Revision as of 13:42, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Openai_Whisper_BasicTextNormalizer.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Text_Normalization, Evaluation
Last Updated 2026-02-13 22:00 GMT

Overview

Concrete tool for language-agnostic text normalization provided by the Whisper normalizers module, handling Unicode symbol removal, diacritic stripping, and basic text cleanup.

Description

The basic.py module provides multilingual text normalization utilities:

  • ADDITIONAL_DIACRITICS: A mapping of special non-ASCII characters that are not decomposed by Unicode NFKD normalization (e.g., "oe" → "oe", "ss" → "ss", "ð" → "d", "þ" → "th").
  • remove_symbols_and_diacritics(s, keep=""): Applies Unicode NFKD normalization, strips combining diacritical marks (Unicode category Mn), replaces symbols/markers/punctuation (categories M, S, P) with spaces, and maps special characters via ADDITIONAL_DIACRITICS. An optional keep parameter preserves specified characters.
  • remove_symbols(s): Applies Unicode NFKC normalization and replaces symbols/markers/punctuation with spaces while preserving diacritics.
  • BasicTextNormalizer: The main class that lowercases text, removes bracketed and parenthesized content, applies symbol/diacritic removal, and optionally splits text into individual Unicode graphemes.

Usage

Import BasicTextNormalizer when evaluating Whisper transcription output for non-English languages where English-specific normalization (contractions, number spelling, British/American variants) is not applicable. Use remove_symbols_and_diacritics directly when building custom normalization pipelines.

Code Reference

Source Location

Signature

ADDITIONAL_DIACRITICS = {
    "œ": "oe", "Œ": "OE", "ø": "o", "Ø": "O",
    "æ": "ae", "Æ": "AE", "ß": "ss", "ẞ": "SS",
    "đ": "d", "Đ": "D", "ð": "d", "Ð": "D",
    "þ": "th", "Þ": "th", "ł": "l", "Ł": "L",
}

def remove_symbols_and_diacritics(s: str, keep: str = "") -> str:
    """Replace markers, symbols, punctuations with space; drop diacritics."""
    ...

def remove_symbols(s: str) -> str:
    """Replace markers, symbols, punctuations with space; keep diacritics."""
    ...

class BasicTextNormalizer:
    def __init__(
        self,
        remove_diacritics: bool = False,
        split_letters: bool = False,
    ) -> None: ...
    def __call__(self, s: str) -> str: ...

Import

from whisper.normalizers import BasicTextNormalizer
from whisper.normalizers.basic import remove_symbols_and_diacritics, remove_symbols

I/O Contract

Inputs

Name Type Required Description
s str Yes Raw text to normalize
remove_diacritics bool No If True, strip diacritical marks (default: False)
split_letters bool No If True, split text into individual Unicode graphemes (default: False)
keep str No Characters to preserve during symbol removal (for remove_symbols_and_diacritics only)

Outputs

Name Type Description
BasicTextNormalizer.__call__ returns str Normalized lowercase text with symbols removed and optional diacritic stripping
remove_symbols_and_diacritics returns str Text with symbols replaced by spaces and diacritics removed
remove_symbols returns str Text with symbols replaced by spaces, diacritics preserved

Usage Examples

Basic Multilingual Normalization

from whisper.normalizers import BasicTextNormalizer

normalizer = BasicTextNormalizer()

# Symbols removed, case lowered, diacritics kept
text = normalizer("Héllo, Wörld! [applause]")
# Output: "héllo wörld "

With Diacritic Removal

from whisper.normalizers import BasicTextNormalizer

normalizer = BasicTextNormalizer(remove_diacritics=True)

# Diacritics stripped for ASCII-only comparison
text = normalizer("Héllo, Wörld! [applause]")
# Output: "hello world "

Direct Symbol Removal

from whisper.normalizers.basic import remove_symbols_and_diacritics

# Keep currency symbols while removing others
clean = remove_symbols_and_diacritics("Price: $20.50 (café)", keep="$.")
# Output: "Price  $20.50  cafe "

Letter Splitting for CER

from whisper.normalizers import BasicTextNormalizer

# Split into graphemes for Character Error Rate computation
normalizer = BasicTextNormalizer(split_letters=True)
text = normalizer("Hello")
# Output: "h e l l o"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment