Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper EnglishTextNormalizer

From Leeroopedia
Knowledge Sources
Domains NLP, Text_Normalization, Evaluation
Last Updated 2026-02-13 22:00 GMT

Overview

Concrete tool for English-specific text normalization provided by the Whisper normalizers module, combining contraction expansion, number standardization, spelling normalization, and symbol cleanup.

Description

The english.py module provides three classes for English text normalization:

  • EnglishNumberNormalizer: Converts spelled-out numbers to Arabic numerals. Handles ordinals ("twenty first" → "21st"), currency symbols ("$20 million" → "20000000 dollars"), decimals, compound numbers, and special cases like "double oh seven" → "007".
  • EnglishSpellingNormalizer: Loads British-to-American spelling mappings from english.json and applies word-level replacement (e.g., "colour" → "color").
  • EnglishTextNormalizer: The main user-facing class that orchestrates the full normalization pipeline: lowercasing, bracket/parenthesis removal, filler word removal (hmm, uh, um), contraction expansion (won't → will not, can't → can not), title normalization (mr → mister), comma removal from numbers, symbol cleanup, number standardization, and spelling normalization.

Usage

Import EnglishTextNormalizer when evaluating Whisper transcription output against English reference transcripts. It standardizes both hypothesis and reference text so that superficial differences (contractions, spelling variants, number formats) do not inflate Word Error Rate.

Code Reference

Source Location

Signature

class EnglishNumberNormalizer:
    """Convert any spelled-out numbers into arabic numbers."""
    def __init__(self): ...
    def process_words(self, words: List[str]) -> Iterator[str]: ...
    def preprocess(self, s: str) -> str: ...
    def postprocess(self, s: str) -> str: ...
    def __call__(self, s: str) -> str: ...

class EnglishSpellingNormalizer:
    """Applies British-American spelling mappings."""
    def __init__(self): ...
    def __call__(self, s: str) -> str: ...

class EnglishTextNormalizer:
    """Full English text normalization pipeline."""
    def __init__(self): ...
    def __call__(self, s: str) -> str: ...

Import

from whisper.normalizers import EnglishTextNormalizer
from whisper.normalizers.english import EnglishNumberNormalizer, EnglishSpellingNormalizer

I/O Contract

Inputs

Name Type Required Description
s str Yes Raw English text to normalize (for all three classes' __call__ methods)

Outputs

Name Type Description
EnglishNumberNormalizer.__call__ returns str Text with spelled-out numbers converted to Arabic numerals
EnglishSpellingNormalizer.__call__ returns str Text with British spellings replaced by American equivalents
EnglishTextNormalizer.__call__ returns str Fully normalized English text (lowercased, contractions expanded, numbers standardized, spellings unified)

Usage Examples

Basic Text Normalization

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# Contractions, spelling, numbers all handled
text = normalizer("I won't analyse twenty-three colours, Mr. Smith")
# Output: "i will not analyze 23 colors mister smith"

Number Normalization Only

from whisper.normalizers.english import EnglishNumberNormalizer

num_normalizer = EnglishNumberNormalizer()

# Spelled-out numbers to digits
print(num_normalizer("twenty one thousand five hundred"))
# Output: "21500"

# Currency handling
print(num_normalizer("twenty dollars and fifty cents"))
# Output: "$20.50"

# Ordinals
print(num_normalizer("the thirty second"))
# Output: "the 32nd"

WER Evaluation

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

reference = "The colour was analysed by Dr. Smith, who said it won't work"
hypothesis = "the color was analyzed by doctor smith who said it will not work"

# After normalization, both should be nearly identical
ref_norm = normalizer(reference)
hyp_norm = normalizer(hypothesis)
# ref_norm: "the color was analyzed by doctor smith who said it will not work"
# hyp_norm: "the color was analyzed by doctor smith who said it will not work"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment