Principle:Speechbrain Speechbrain ASR Evaluation With WER

Field	Value
Principle Name	ASR_Evaluation_With_WER
Description	Evaluating ASR systems using Word Error Rate and Character Error Rate metrics
Domains	ASR, Evaluation_Metrics
Knowledge Sources	Standard ASR evaluation literature
Related Implementation	Implementation:Speechbrain_Speechbrain_Brain_Evaluate_With_ErrorRateStats

Overview

Word Error Rate (WER) and Character Error Rate (CER) are the standard metrics for evaluating Automatic Speech Recognition systems. They measure the minimum edit distance between the system's hypothesis transcription and the correct reference transcription. In SpeechBrain, evaluation is tightly integrated into the training loop through the Brain.evaluate() method, which loads the best checkpoint, runs inference on the test set, and computes these metrics using the ErrorRateStats class.

Mathematical Foundation

Word Error Rate (WER)

WER computes the minimum number of word-level edits needed to transform the hypothesis into the reference, normalized by the number of words in the reference:

WER = (S + D + I) / N * 100%

Where:

S = number of substitutions (a word in the reference is replaced by a different word)
D = number of deletions (a word in the reference is missing from the hypothesis)
I = number of insertions (an extra word appears in the hypothesis that is not in the reference)
N = total number of words in the reference transcription

The minimum edit distance is computed using the Levenshtein distance algorithm (dynamic programming), which finds the optimal alignment between the hypothesis and reference word sequences.

Character Error Rate (CER)

CER applies the same formula but at the character level instead of the word level. This is particularly useful for:

Languages without clear word boundaries (e.g., Chinese, Japanese)
Character-level ASR models
Understanding whether errors are minor misspellings or completely wrong words

In SpeechBrain, CER is computed by the same ErrorRateStats class with split_tokens=True, which splits word-level predictions into individual characters before computing the edit distance.

Relationship Between WER and CER

CER is typically lower than WER because a single word substitution counts as one error in WER but may represent only a few character-level edits. For example, transcribing "cat" as "cats" is one word substitution (WER) but only one character insertion (CER).

Evaluation Workflow in SpeechBrain

The evaluation process follows these steps:

1. Checkpoint Loading

Brain.evaluate() calls on_evaluate_start(min_key="WER"), which instructs the checkpointer to load the checkpoint with the lowest WER value. This ensures evaluation uses the best model seen during training, not necessarily the last one.

2. Stage Initialization

on_stage_start(TEST, epoch=None) initializes fresh ErrorRateStats instances for both WER and CER tracking. This clears any accumulated statistics from previous evaluations.

3. Batch-Level Evaluation

For each test batch:

compute_forward(batch, TEST) runs the model and applies beam search decoding to produce word-level hypotheses
compute_objectives(predictions, batch, TEST) computes the CTC loss and calls ErrorRateStats.append() to accumulate per-utterance error statistics

4. Metric Summarization

on_stage_end(TEST, avg_test_loss, None) calls ErrorRateStats.summarize() to aggregate all per-utterance scores into corpus-level statistics including overall WER, total insertions, deletions, and substitutions.

5. Detailed Output

ErrorRateStats.write_stats() produces a detailed report including:

Corpus-level WER summary
Per-utterance alignments showing exactly where each error occurred
Breakdown of error types (substitutions, deletions, insertions)

Per-Utterance vs. Corpus-Level Metrics

SpeechBrain computes WER at two granularities:

Per-utterance WER -- the edit distance for each individual utterance, useful for identifying problematic examples
Corpus-level WER -- the total edits across all utterances divided by the total reference words, providing the overall system performance metric

The corpus-level metric is used for:

Checkpoint selection during training (via min_keys=["WER"])
Final system evaluation on the test set
Comparison with published results

Decoding Strategies and WER

The choice of decoding strategy affects WER:

Greedy decoding (used during validation) -- selects the most probable token at each time step; fast but suboptimal
Beam search (used during testing) -- explores multiple hypotheses simultaneously and can incorporate language model scores; slower but typically yields lower WER
Language model fusion -- optionally incorporates an n-gram language model during beam search to further reduce WER

In the CTC ASR recipe, greedy decoding is used during validation (for speed) and beam search with configurable beam size is used during final testing (for accuracy).

Alignment Visualization

The write_stats() method outputs detailed alignment information that shows:

utterance_id: utt001
REF:  THE CAT SAT ON THE MAT
HYP:  THE CAT SET ON A MAT
EVAL:     =   =   S  =  S  =
Errors: 2 substitutions, 0 deletions, 0 insertions
WER: 33.33%

This visualization is invaluable for error analysis and understanding systematic failure patterns.

Related Concepts

Implementation:Speechbrain_Speechbrain_Brain_Evaluate_With_ErrorRateStats -- the concrete implementation of evaluation with ErrorRateStats
WER is the primary metric used for checkpoint selection during CTC training (via min_keys=["WER"])
CER provides complementary information about the nature and severity of recognition errors

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment