Principle:Speechbrain Speechbrain ASR Evaluation With WER
| Field | Value |
|---|---|
| Principle Name | ASR_Evaluation_With_WER |
| Description | Evaluating ASR systems using Word Error Rate and Character Error Rate metrics |
| Domains | ASR, Evaluation_Metrics |
| Knowledge Sources | Standard ASR evaluation literature |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Brain_Evaluate_With_ErrorRateStats |
Overview
Word Error Rate (WER) and Character Error Rate (CER) are the standard metrics for evaluating Automatic Speech Recognition systems. They measure the minimum edit distance between the system's hypothesis transcription and the correct reference transcription. In SpeechBrain, evaluation is tightly integrated into the training loop through the Brain.evaluate() method, which loads the best checkpoint, runs inference on the test set, and computes these metrics using the ErrorRateStats class.
Mathematical Foundation
Word Error Rate (WER)
WER computes the minimum number of word-level edits needed to transform the hypothesis into the reference, normalized by the number of words in the reference:
WER = (S + D + I) / N * 100%
Where:
- S = number of substitutions (a word in the reference is replaced by a different word)
- D = number of deletions (a word in the reference is missing from the hypothesis)
- I = number of insertions (an extra word appears in the hypothesis that is not in the reference)
- N = total number of words in the reference transcription
The minimum edit distance is computed using the Levenshtein distance algorithm (dynamic programming), which finds the optimal alignment between the hypothesis and reference word sequences.
Character Error Rate (CER)
CER applies the same formula but at the character level instead of the word level. This is particularly useful for:
- Languages without clear word boundaries (e.g., Chinese, Japanese)
- Character-level ASR models
- Understanding whether errors are minor misspellings or completely wrong words
In SpeechBrain, CER is computed by the same ErrorRateStats class with split_tokens=True, which splits word-level predictions into individual characters before computing the edit distance.
Relationship Between WER and CER
CER is typically lower than WER because a single word substitution counts as one error in WER but may represent only a few character-level edits. For example, transcribing "cat" as "cats" is one word substitution (WER) but only one character insertion (CER).
Evaluation Workflow in SpeechBrain
The evaluation process follows these steps:
1. Checkpoint Loading
Brain.evaluate() calls on_evaluate_start(min_key="WER"), which instructs the checkpointer to load the checkpoint with the lowest WER value. This ensures evaluation uses the best model seen during training, not necessarily the last one.
2. Stage Initialization
on_stage_start(TEST, epoch=None) initializes fresh ErrorRateStats instances for both WER and CER tracking. This clears any accumulated statistics from previous evaluations.
3. Batch-Level Evaluation
For each test batch:
compute_forward(batch, TEST)runs the model and applies beam search decoding to produce word-level hypothesescompute_objectives(predictions, batch, TEST)computes the CTC loss and callsErrorRateStats.append()to accumulate per-utterance error statistics
4. Metric Summarization
on_stage_end(TEST, avg_test_loss, None) calls ErrorRateStats.summarize() to aggregate all per-utterance scores into corpus-level statistics including overall WER, total insertions, deletions, and substitutions.
5. Detailed Output
ErrorRateStats.write_stats() produces a detailed report including:
- Corpus-level WER summary
- Per-utterance alignments showing exactly where each error occurred
- Breakdown of error types (substitutions, deletions, insertions)
Per-Utterance vs. Corpus-Level Metrics
SpeechBrain computes WER at two granularities:
- Per-utterance WER -- the edit distance for each individual utterance, useful for identifying problematic examples
- Corpus-level WER -- the total edits across all utterances divided by the total reference words, providing the overall system performance metric
The corpus-level metric is used for:
- Checkpoint selection during training (via
min_keys=["WER"]) - Final system evaluation on the test set
- Comparison with published results
Decoding Strategies and WER
The choice of decoding strategy affects WER:
- Greedy decoding (used during validation) -- selects the most probable token at each time step; fast but suboptimal
- Beam search (used during testing) -- explores multiple hypotheses simultaneously and can incorporate language model scores; slower but typically yields lower WER
- Language model fusion -- optionally incorporates an n-gram language model during beam search to further reduce WER
In the CTC ASR recipe, greedy decoding is used during validation (for speed) and beam search with configurable beam size is used during final testing (for accuracy).
Alignment Visualization
The write_stats() method outputs detailed alignment information that shows:
utterance_id: utt001
REF: THE CAT SAT ON THE MAT
HYP: THE CAT SET ON A MAT
EVAL: = = S = S =
Errors: 2 substitutions, 0 deletions, 0 insertions
WER: 33.33%
This visualization is invaluable for error analysis and understanding systematic failure patterns.
Related Concepts
- Implementation:Speechbrain_Speechbrain_Brain_Evaluate_With_ErrorRateStats -- the concrete implementation of evaluation with ErrorRateStats
- WER is the primary metric used for checkpoint selection during CTC training (via
min_keys=["WER"]) - CER provides complementary information about the nature and severity of recognition errors