Implementation:Openai Whisper EnglishTextNormalizer
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Normalization, Evaluation |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Concrete tool for English-specific text normalization provided by the Whisper normalizers module, combining contraction expansion, number standardization, spelling normalization, and symbol cleanup.
Description
The english.py module provides three classes for English text normalization:
- EnglishNumberNormalizer: Converts spelled-out numbers to Arabic numerals. Handles ordinals ("twenty first" → "21st"), currency symbols ("$20 million" → "20000000 dollars"), decimals, compound numbers, and special cases like "double oh seven" → "007".
- EnglishSpellingNormalizer: Loads British-to-American spelling mappings from english.json and applies word-level replacement (e.g., "colour" → "color").
- EnglishTextNormalizer: The main user-facing class that orchestrates the full normalization pipeline: lowercasing, bracket/parenthesis removal, filler word removal (hmm, uh, um), contraction expansion (won't → will not, can't → can not), title normalization (mr → mister), comma removal from numbers, symbol cleanup, number standardization, and spelling normalization.
Usage
Import EnglishTextNormalizer when evaluating Whisper transcription output against English reference transcripts. It standardizes both hypothesis and reference text so that superficial differences (contractions, spelling variants, number formats) do not inflate Word Error Rate.
Code Reference
Source Location
- Repository: Openai_Whisper
- File: whisper/normalizers/english.py
- Lines: 1-550
Signature
class EnglishNumberNormalizer:
"""Convert any spelled-out numbers into arabic numbers."""
def __init__(self): ...
def process_words(self, words: List[str]) -> Iterator[str]: ...
def preprocess(self, s: str) -> str: ...
def postprocess(self, s: str) -> str: ...
def __call__(self, s: str) -> str: ...
class EnglishSpellingNormalizer:
"""Applies British-American spelling mappings."""
def __init__(self): ...
def __call__(self, s: str) -> str: ...
class EnglishTextNormalizer:
"""Full English text normalization pipeline."""
def __init__(self): ...
def __call__(self, s: str) -> str: ...
Import
from whisper.normalizers import EnglishTextNormalizer
from whisper.normalizers.english import EnglishNumberNormalizer, EnglishSpellingNormalizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| s | str | Yes | Raw English text to normalize (for all three classes' __call__ methods) |
Outputs
| Name | Type | Description |
|---|---|---|
| EnglishNumberNormalizer.__call__ returns | str | Text with spelled-out numbers converted to Arabic numerals |
| EnglishSpellingNormalizer.__call__ returns | str | Text with British spellings replaced by American equivalents |
| EnglishTextNormalizer.__call__ returns | str | Fully normalized English text (lowercased, contractions expanded, numbers standardized, spellings unified) |
Usage Examples
Basic Text Normalization
from whisper.normalizers import EnglishTextNormalizer
normalizer = EnglishTextNormalizer()
# Contractions, spelling, numbers all handled
text = normalizer("I won't analyse twenty-three colours, Mr. Smith")
# Output: "i will not analyze 23 colors mister smith"
Number Normalization Only
from whisper.normalizers.english import EnglishNumberNormalizer
num_normalizer = EnglishNumberNormalizer()
# Spelled-out numbers to digits
print(num_normalizer("twenty one thousand five hundred"))
# Output: "21500"
# Currency handling
print(num_normalizer("twenty dollars and fifty cents"))
# Output: "$20.50"
# Ordinals
print(num_normalizer("the thirty second"))
# Output: "the 32nd"
WER Evaluation
from whisper.normalizers import EnglishTextNormalizer
normalizer = EnglishTextNormalizer()
reference = "The colour was analysed by Dr. Smith, who said it won't work"
hypothesis = "the color was analyzed by doctor smith who said it will not work"
# After normalization, both should be nearly identical
ref_norm = normalizer(reference)
hyp_norm = normalizer(hypothesis)
# ref_norm: "the color was analyzed by doctor smith who said it will not work"
# hyp_norm: "the color was analyzed by doctor smith who said it will not work"