Implementation:Openai Whisper EnglishTextNormalizer

Knowledge Sources	Openai_Whisper
Domains	NLP, Text_Normalization, Evaluation
Last Updated	2026-02-13 22:00 GMT

Overview

Concrete tool for English-specific text normalization provided by the Whisper normalizers module, combining contraction expansion, number standardization, spelling normalization, and symbol cleanup.

Description

The english.py module provides three classes for English text normalization:

EnglishNumberNormalizer: Converts spelled-out numbers to Arabic numerals. Handles ordinals ("twenty first" → "21st"), currency symbols ("$20 million" → "20000000 dollars"), decimals, compound numbers, and special cases like "double oh seven" → "007".

EnglishSpellingNormalizer: Loads British-to-American spelling mappings from english.json and applies word-level replacement (e.g., "colour" → "color").

EnglishTextNormalizer: The main user-facing class that orchestrates the full normalization pipeline: lowercasing, bracket/parenthesis removal, filler word removal (hmm, uh, um), contraction expansion (won't → will not, can't → can not), title normalization (mr → mister), comma removal from numbers, symbol cleanup, number standardization, and spelling normalization.

Usage

Import EnglishTextNormalizer when evaluating Whisper transcription output against English reference transcripts. It standardizes both hypothesis and reference text so that superficial differences (contractions, spelling variants, number formats) do not inflate Word Error Rate.

Code Reference

Source Location

Repository: Openai_Whisper
File: whisper/normalizers/english.py
Lines: 1-550

Signature

class EnglishNumberNormalizer:
    """Convert any spelled-out numbers into arabic numbers."""
    def __init__(self): ...
    def process_words(self, words: List[str]) -> Iterator[str]: ...
    def preprocess(self, s: str) -> str: ...
    def postprocess(self, s: str) -> str: ...
    def __call__(self, s: str) -> str: ...

class EnglishSpellingNormalizer:
    """Applies British-American spelling mappings."""
    def __init__(self): ...
    def __call__(self, s: str) -> str: ...

class EnglishTextNormalizer:
    """Full English text normalization pipeline."""
    def __init__(self): ...
    def __call__(self, s: str) -> str: ...

Import

from whisper.normalizers import EnglishTextNormalizer
from whisper.normalizers.english import EnglishNumberNormalizer, EnglishSpellingNormalizer

I/O Contract

Inputs

Name	Type	Required	Description
s	str	Yes	Raw English text to normalize (for all three classes' __call__ methods)

Outputs

Name	Type	Description
EnglishNumberNormalizer.__call__ returns	str	Text with spelled-out numbers converted to Arabic numerals
EnglishSpellingNormalizer.__call__ returns	str	Text with British spellings replaced by American equivalents
EnglishTextNormalizer.__call__ returns	str	Fully normalized English text (lowercased, contractions expanded, numbers standardized, spellings unified)

Usage Examples

Basic Text Normalization

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# Contractions, spelling, numbers all handled
text = normalizer("I won't analyse twenty-three colours, Mr. Smith")
# Output: "i will not analyze 23 colors mister smith"

Number Normalization Only

from whisper.normalizers.english import EnglishNumberNormalizer

num_normalizer = EnglishNumberNormalizer()

# Spelled-out numbers to digits
print(num_normalizer("twenty one thousand five hundred"))
# Output: "21500"

# Currency handling
print(num_normalizer("twenty dollars and fifty cents"))
# Output: "$20.50"

# Ordinals
print(num_normalizer("the thirty second"))
# Output: "the 32nd"

WER Evaluation

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

reference = "The colour was analysed by Dr. Smith, who said it won't work"
hypothesis = "the color was analyzed by doctor smith who said it will not work"

# After normalization, both should be nearly identical
ref_norm = normalizer(reference)
hyp_norm = normalizer(hypothesis)
# ref_norm: "the color was analyzed by doctor smith who said it will not work"
# hyp_norm: "the color was analyzed by doctor smith who said it will not work"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment