Principle:Openai Whisper English Text Normalization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Normalization, Evaluation |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Text normalization technique that canonicalizes English transcripts by expanding contractions, standardizing number formats, and unifying British/American spelling variants to enable fair Word Error Rate evaluation.
Description
English Text Normalization addresses the problem that raw speech transcripts contain many surface-level variations that are semantically equivalent but textually different. Without normalization, Word Error Rate (WER) metrics would penalize a model for producing "color" when the reference says "colour", or "21" when the reference says "twenty one".
The normalization pipeline applies a series of rule-based transformations:
- Lowercasing: Converts all text to lowercase for case-insensitive comparison.
- Bracket/Parenthesis Removal: Strips non-verbal annotations like stage directions.
- Filler Word Removal: Removes speech disfluencies (hmm, uh, um, mm).
- Contraction Expansion: Expands contractions to their full forms (won't → will not, can't → can not).
- Title Normalization: Expands abbreviations (mr → mister, dr → doctor).
- Number Standardization: Converts spelled-out numbers to Arabic numerals using a finite-state parser that handles ordinals, currency, decimals, and compound numbers.
- Spelling Normalization: Maps British English spellings to American English equivalents using a lookup dictionary.
- Symbol Cleanup: Removes stray currency symbols, percentage signs, and extra whitespace.
Usage
Use this principle when computing WER or other text-similarity metrics for English speech recognition evaluation. Apply the same normalizer to both the reference transcript and the model hypothesis to ensure that only genuine recognition errors are counted.
Theoretical Basis
The normalization follows a deterministic rule-based approach rather than a learned model. The key insight is that ASR evaluation should measure semantic accuracy, not stylistic choices.
Number Parsing Algorithm:
The number normalizer uses a streaming finite-state parser that processes words left-to-right, accumulating numeric values:
# Abstract algorithm (not actual implementation)
for word in words:
if word is a digit name (one, two, ...):
accumulate into current value
elif word is a multiplier (hundred, thousand, ...):
multiply accumulated value by multiplier
elif word is a prefix (minus, dollar, ...):
set prefix for next number
elif word is a suffix (percent, ...):
append suffix symbol to current number
else:
yield accumulated number, start new accumulation
Spelling Normalization:
Uses a simple dictionary lookup with approximately 1741 British-to-American mappings derived from systematic spelling differences (-ise/-ize, -our/-or, -re/-er, etc.).