Implementation:Openai Whisper BasicTextNormalizer
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Normalization, Evaluation |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Concrete tool for language-agnostic text normalization provided by the Whisper normalizers module, handling Unicode symbol removal, diacritic stripping, and basic text cleanup.
Description
The basic.py module provides multilingual text normalization utilities:
- ADDITIONAL_DIACRITICS: A mapping of special non-ASCII characters that are not decomposed by Unicode NFKD normalization (e.g., "oe" → "oe", "ss" → "ss", "ð" → "d", "þ" → "th").
- remove_symbols_and_diacritics(s, keep=""): Applies Unicode NFKD normalization, strips combining diacritical marks (Unicode category Mn), replaces symbols/markers/punctuation (categories M, S, P) with spaces, and maps special characters via ADDITIONAL_DIACRITICS. An optional keep parameter preserves specified characters.
- remove_symbols(s): Applies Unicode NFKC normalization and replaces symbols/markers/punctuation with spaces while preserving diacritics.
- BasicTextNormalizer: The main class that lowercases text, removes bracketed and parenthesized content, applies symbol/diacritic removal, and optionally splits text into individual Unicode graphemes.
Usage
Import BasicTextNormalizer when evaluating Whisper transcription output for non-English languages where English-specific normalization (contractions, number spelling, British/American variants) is not applicable. Use remove_symbols_and_diacritics directly when building custom normalization pipelines.
Code Reference
Source Location
- Repository: Openai_Whisper
- File: whisper/normalizers/basic.py
- Lines: 1-80
Signature
ADDITIONAL_DIACRITICS = {
"œ": "oe", "Œ": "OE", "ø": "o", "Ø": "O",
"æ": "ae", "Æ": "AE", "ß": "ss", "ẞ": "SS",
"đ": "d", "Đ": "D", "ð": "d", "Ð": "D",
"þ": "th", "Þ": "th", "ł": "l", "Ł": "L",
}
def remove_symbols_and_diacritics(s: str, keep: str = "") -> str:
"""Replace markers, symbols, punctuations with space; drop diacritics."""
...
def remove_symbols(s: str) -> str:
"""Replace markers, symbols, punctuations with space; keep diacritics."""
...
class BasicTextNormalizer:
def __init__(
self,
remove_diacritics: bool = False,
split_letters: bool = False,
) -> None: ...
def __call__(self, s: str) -> str: ...
Import
from whisper.normalizers import BasicTextNormalizer
from whisper.normalizers.basic import remove_symbols_and_diacritics, remove_symbols
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| s | str | Yes | Raw text to normalize |
| remove_diacritics | bool | No | If True, strip diacritical marks (default: False) |
| split_letters | bool | No | If True, split text into individual Unicode graphemes (default: False) |
| keep | str | No | Characters to preserve during symbol removal (for remove_symbols_and_diacritics only) |
Outputs
| Name | Type | Description |
|---|---|---|
| BasicTextNormalizer.__call__ returns | str | Normalized lowercase text with symbols removed and optional diacritic stripping |
| remove_symbols_and_diacritics returns | str | Text with symbols replaced by spaces and diacritics removed |
| remove_symbols returns | str | Text with symbols replaced by spaces, diacritics preserved |
Usage Examples
Basic Multilingual Normalization
from whisper.normalizers import BasicTextNormalizer
normalizer = BasicTextNormalizer()
# Symbols removed, case lowered, diacritics kept
text = normalizer("Héllo, Wörld! [applause]")
# Output: "héllo wörld "
With Diacritic Removal
from whisper.normalizers import BasicTextNormalizer
normalizer = BasicTextNormalizer(remove_diacritics=True)
# Diacritics stripped for ASCII-only comparison
text = normalizer("Héllo, Wörld! [applause]")
# Output: "hello world "
Direct Symbol Removal
from whisper.normalizers.basic import remove_symbols_and_diacritics
# Keep currency symbols while removing others
clean = remove_symbols_and_diacritics("Price: $20.50 (café)", keep="$.")
# Output: "Price $20.50 cafe "
Letter Splitting for CER
from whisper.normalizers import BasicTextNormalizer
# Split into graphemes for Character Error Rate computation
normalizer = BasicTextNormalizer(split_letters=True)
text = normalizer("Hello")
# Output: "h e l l o"