Implementation:Openai Whisper BasicTextNormalizer

Knowledge Sources	Openai_Whisper
Domains	NLP, Text_Normalization, Evaluation
Last Updated	2026-02-13 22:00 GMT

Overview

Concrete tool for language-agnostic text normalization provided by the Whisper normalizers module, handling Unicode symbol removal, diacritic stripping, and basic text cleanup.

Description

The basic.py module provides multilingual text normalization utilities:

ADDITIONAL_DIACRITICS: A mapping of special non-ASCII characters that are not decomposed by Unicode NFKD normalization (e.g., "oe" → "oe", "ss" → "ss", "ð" → "d", "þ" → "th").

remove_symbols_and_diacritics(s, keep=""): Applies Unicode NFKD normalization, strips combining diacritical marks (Unicode category Mn), replaces symbols/markers/punctuation (categories M, S, P) with spaces, and maps special characters via ADDITIONAL_DIACRITICS. An optional keep parameter preserves specified characters.

remove_symbols(s): Applies Unicode NFKC normalization and replaces symbols/markers/punctuation with spaces while preserving diacritics.

BasicTextNormalizer: The main class that lowercases text, removes bracketed and parenthesized content, applies symbol/diacritic removal, and optionally splits text into individual Unicode graphemes.

Usage

Import BasicTextNormalizer when evaluating Whisper transcription output for non-English languages where English-specific normalization (contractions, number spelling, British/American variants) is not applicable. Use remove_symbols_and_diacritics directly when building custom normalization pipelines.

Code Reference

Source Location

Repository: Openai_Whisper
File: whisper/normalizers/basic.py
Lines: 1-80

Signature

ADDITIONAL_DIACRITICS = {
    "œ": "oe", "Œ": "OE", "ø": "o", "Ø": "O",
    "æ": "ae", "Æ": "AE", "ß": "ss", "ẞ": "SS",
    "đ": "d", "Đ": "D", "ð": "d", "Ð": "D",
    "þ": "th", "Þ": "th", "ł": "l", "Ł": "L",
}

def remove_symbols_and_diacritics(s: str, keep: str = "") -> str:
    """Replace markers, symbols, punctuations with space; drop diacritics."""
    ...

def remove_symbols(s: str) -> str:
    """Replace markers, symbols, punctuations with space; keep diacritics."""
    ...

class BasicTextNormalizer:
    def __init__(
        self,
        remove_diacritics: bool = False,
        split_letters: bool = False,
    ) -> None: ...
    def __call__(self, s: str) -> str: ...

Import

from whisper.normalizers import BasicTextNormalizer
from whisper.normalizers.basic import remove_symbols_and_diacritics, remove_symbols

I/O Contract

Inputs

Name	Type	Required	Description
s	str	Yes	Raw text to normalize
remove_diacritics	bool	No	If True, strip diacritical marks (default: False)
split_letters	bool	No	If True, split text into individual Unicode graphemes (default: False)
keep	str	No	Characters to preserve during symbol removal (for remove_symbols_and_diacritics only)

Outputs

Name	Type	Description
BasicTextNormalizer.__call__ returns	str	Normalized lowercase text with symbols removed and optional diacritic stripping
remove_symbols_and_diacritics returns	str	Text with symbols replaced by spaces and diacritics removed
remove_symbols returns	str	Text with symbols replaced by spaces, diacritics preserved

Usage Examples

Basic Multilingual Normalization

from whisper.normalizers import BasicTextNormalizer

normalizer = BasicTextNormalizer()

# Symbols removed, case lowered, diacritics kept
text = normalizer("Héllo, Wörld! [applause]")
# Output: "héllo wörld "

With Diacritic Removal

from whisper.normalizers import BasicTextNormalizer

normalizer = BasicTextNormalizer(remove_diacritics=True)

# Diacritics stripped for ASCII-only comparison
text = normalizer("Héllo, Wörld! [applause]")
# Output: "hello world "

Direct Symbol Removal

from whisper.normalizers.basic import remove_symbols_and_diacritics

# Keep currency symbols while removing others
clean = remove_symbols_and_diacritics("Price: $20.50 (café)", keep="$.")
# Output: "Price  $20.50  cafe "

Letter Splitting for CER

from whisper.normalizers import BasicTextNormalizer

# Split into graphemes for Character Error Rate computation
normalizer = BasicTextNormalizer(split_letters=True)
text = normalizer("Hello")
# Output: "h e l l o"

Related Pages

Principle:Openai_Whisper_Basic_Text_Normalization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment