Implementation:Speechbrain Speechbrain ErrorRateStats For Whisper
| Field | Value |
|---|---|
| API | ErrorRateStats(merge_tokens=False, split_tokens=False, space_token="_", keep_values=True, extract_concepts_values=False, tag_in="", tag_out="", equality_comparator=_str_equals) with append(ids, predict, target, predict_len=None, target_len=None, ind2lab=None) and summarize(field=None) |
| Source | speechbrain/utils/metric_stats.py:L206-378 |
| Import | from speechbrain.utils.metric_stats import ErrorRateStats |
| Type | API Doc (same class as CTC evaluation, different context: Whisper-specific normalization applied before metric computation) |
| Outputs | summarize() returns dict: {WER, error_rate, insertions, deletions, substitutions} |
| Related Principle | Principle:Speechbrain_Speechbrain_WER_CER_Evaluation_For_Whisper |
Purpose
Computes Word Error Rate (WER) and Character Error Rate (CER) for evaluating Whisper ASR fine-tuning results. This is the same ErrorRateStats class used for CTC-based ASR evaluation, but in the Whisper context, hypotheses are first decoded via tokenizer.decode(t, skip_special_tokens=True) and normalized using tokenizer.normalize(text) before being passed to the metric computation.
Constructor
WER Computer
from speechbrain.utils.metric_stats import ErrorRateStats
# WER: default configuration (word-level comparison)
wer_metric = ErrorRateStats()
CER Computer
# CER: split_tokens=True splits words into characters
cer_metric = ErrorRateStats(split_tokens=True)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| merge_tokens | bool | False | If True, merges successive tokens into words (e.g., character-to-word) |
| split_tokens | bool | False | If True, splits tokens into characters (set True for CER) |
| space_token | str | "_" | Token used as word boundary for merge/split operations |
| keep_values | bool | True | Whether to keep concept values (for concept error rate) |
| extract_concepts_values | bool | False | Whether to extract concepts and values from predict/target |
| tag_in | str | "" | Start tag for concept extraction |
| tag_out | str | "" | End tag for concept extraction |
| equality_comparator | Callable | _str_equals | Function to compare two tokens for equality |
YAML Configuration
# WER computer (word-level)
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
# CER computer (character-level)
cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
split_tokens: True
Usage in Whisper Fine-Tuning
The key difference from CTC evaluation is how hypotheses and targets are prepared before passing them to ErrorRateStats:
# In ASR.compute_objectives():
# Step 1: Decode hypothesis token IDs to text strings
predicted_words = [
self.tokenizer.decode(t, skip_special_tokens=True).strip()
for t in hyps
]
# Step 2: Decode target token IDs to text strings
target_words = undo_padding(tokens, tokens_lens)
target_words = self.tokenizer.batch_decode(
target_words, skip_special_tokens=True
)
# Step 3: Apply Whisper-specific normalization
if hasattr(self.hparams, "normalized_transcripts"):
predicted_words = [
self.tokenizer.normalize(text).split(" ")
for text in predicted_words
]
target_words = [
self.tokenizer.normalize(text).split(" ")
for text in target_words
]
else:
predicted_words = [text.split(" ") for text in predicted_words]
target_words = [text.split(" ") for text in target_words]
# Step 4: Append to metrics (word lists, not tensors)
self.wer_metric.append(ids, predicted_words, target_words)
self.cer_metric.append(ids, predicted_words, target_words)
append() Method
def append(
self,
ids, # list of utterance ID strings
predict, # list of predicted word lists: [["hello", "world"], ...]
target, # list of target word lists: [["hello", "world"], ...]
predict_len=None, # optional relative lengths for unpadding
target_len=None, # optional relative lengths for unpadding
ind2lab=None, # optional index-to-label mapping function
):
When called from the Whisper recipe, predict and target are already lists of word lists (after tokenizer decoding and normalization), so predict_len, target_len, and ind2lab are not needed.
Internally, append computes edit distance alignments using wer_details_for_batch and stores per-utterance scores.
summarize() Method
stats = wer_metric.summarize()
# Returns:
# {
# "WER": 15.3, # Word Error Rate as percentage
# "error_rate": 15.3, # Same as WER (generic alias)
# "insertions": 42, # Total insertion errors
# "deletions": 31, # Total deletion errors
# "substitutions": 67, # Total substitution errors
# "num_ref_tokens": ..., # Total reference tokens
# ...
# }
# Access specific field:
wer_value = wer_metric.summarize("error_rate")
# Returns: 15.3
Lifecycle in Training
# on_stage_start: Initialize fresh metric computers
def on_stage_start(self, stage, epoch):
if stage != sb.Stage.TRAIN:
self.cer_metric = self.hparams.cer_computer()
self.wer_metric = self.hparams.error_rate_computer()
# on_stage_end: Summarize and log
def on_stage_end(self, stage, stage_loss, epoch):
stage_stats = {"loss": stage_loss}
if stage != sb.Stage.TRAIN:
stage_stats["CER"] = self.cer_metric.summarize("error_rate")
stage_stats["WER"] = self.wer_metric.summarize("error_rate")
if stage == sb.Stage.VALID:
# Save checkpoint keyed on WER
self.checkpointer.save_and_keep_only(
meta={"WER": stage_stats["WER"]},
min_keys=["WER"],
)
elif stage == sb.Stage.TEST:
# Write detailed WER statistics to file
with open(self.hparams.test_wer_file, "w") as w:
self.wer_metric.write_stats(w)
write_stats() Method
Writes detailed alignment information to a file, including per-utterance alignments showing insertions, deletions, and substitutions:
with open("wer_test.txt", "w", encoding="utf-8") as f:
wer_metric.write_stats(f)
This calls print_wer_summary and print_alignments from speechbrain.dataio.wer to produce a human-readable report.