Principle:Speechbrain Speechbrain WER CER Evaluation For Whisper
| Field | Value |
|---|---|
| Concept | Evaluating Whisper fine-tuning results using word and character error rates with Whisper-specific text normalization |
| Domains | Evaluation_Metrics, ASR |
| Related Implementation | Implementation:Speechbrain_Speechbrain_ErrorRateStats_For_Whisper |
Overview
Word Error Rate (WER) and Character Error Rate (CER) are the standard metrics for evaluating automatic speech recognition systems. For Whisper fine-tuning, evaluation requires careful integration with Whisper's built-in text normalizer to ensure fair and reproducible comparisons with published results.
Error Rate Computation
Both WER and CER are computed as the edit distance (Levenshtein distance) between the hypothesis (model output) and the reference (ground truth), normalized by the reference length:
Error Rate = (Substitutions + Insertions + Deletions) / Number of Reference Tokens
- WER: Tokens are words (space-delimited). Measures how well the model captures word-level content.
- CER: Tokens are individual characters. Provides a finer-grained measure, especially useful for languages without clear word boundaries or with agglutinative morphology.
The three error types provide diagnostic information:
- Substitutions: A word in the reference is replaced by a different word in the hypothesis. Often indicates acoustic confusions or language model errors.
- Insertions: An extra word appears in the hypothesis. Can indicate hallucination or repeated content.
- Deletions: A word in the reference is missing from the hypothesis. Can indicate truncation or missed content.
Whisper-Specific Text Normalization
Whisper's published results use a specific text normalization pipeline (described in Appendix C of the Whisper paper). This normalization is critical for fair comparison:
- Case folding: All text is lowercased.
- Punctuation removal: Punctuation marks are stripped.
- Number normalization: Numeric expressions are converted to a canonical form.
- Language-specific processing: Rules for handling language-specific text conventions (e.g., contractions in English, accented characters in French).
In SpeechBrain, this normalization is applied via tokenizer.normalize(text) when the normalized_transcripts hyperparameter is enabled. Both the predicted words and the target words must undergo the same normalization before error computation:
# Normalize predictions
predicted_words = [
tokenizer.normalize(text).split(" ")
for text in predicted_words
]
# Normalize targets
target_words = [
tokenizer.normalize(text).split(" ")
for text in target_words
]
Evaluation Pipeline
The evaluation pipeline in the Whisper fine-tuning recipe follows these steps:
- Decode hypotheses: Token IDs from beam search or greedy search are decoded to text using tokenizer.decode(t, skip_special_tokens=True), which removes all special tokens (language, task, timestamps) and converts BPE tokens back to text.
- Decode references: Target token IDs are unpadded and decoded using tokenizer.batch_decode(target_words, skip_special_tokens=True).
- Normalize: Both hypotheses and references are normalized using tokenizer.normalize() if normalized_transcripts is enabled.
- Split to words: Normalized text strings are split on spaces to produce word lists.
- Compute metrics: Word lists are passed to ErrorRateStats for WER and CER computation.
Checkpointing by WER
The Whisper fine-tuning recipe uses WER as the primary metric for model selection:
checkpointer.save_and_keep_only(
meta={"WER": stage_stats["WER"]},
min_keys=["WER"],
)
The min_keys=["WER"] argument ensures that only the checkpoint with the lowest validation WER is retained. During test evaluation, this best checkpoint is loaded via min_key="WER".
WER vs. CER
Both metrics are tracked during evaluation:
- WER (via ErrorRateStats()): Primary metric. Used for checkpointing and model selection.
- CER (via ErrorRateStats(split_tokens=True)): Secondary metric. The split_tokens=True parameter causes each word to be split into individual characters before computing the edit distance.
CER is particularly informative for:
- Languages with complex morphology where a single character substitution changes the word.
- Evaluating whether errors are minor (single character) or catastrophic (entire word wrong).
- Languages without space-delimited words (e.g., Chinese, Japanese) where WER may not be meaningful.