Implementation:Speechbrain Speechbrain ErrorRateStats For Whisper

Field	Value
API	ErrorRateStats(merge_tokens=False, split_tokens=False, space_token="_", keep_values=True, extract_concepts_values=False, tag_in="", tag_out="", equality_comparator=_str_equals) with append(ids, predict, target, predict_len=None, target_len=None, ind2lab=None) and summarize(field=None)
Source	speechbrain/utils/metric_stats.py:L206-378
Import	from speechbrain.utils.metric_stats import ErrorRateStats
Type	API Doc (same class as CTC evaluation, different context: Whisper-specific normalization applied before metric computation)
Outputs	summarize() returns dict: {WER, error_rate, insertions, deletions, substitutions}
Related Principle	Principle:Speechbrain_Speechbrain_WER_CER_Evaluation_For_Whisper

Purpose

Computes Word Error Rate (WER) and Character Error Rate (CER) for evaluating Whisper ASR fine-tuning results. This is the same ErrorRateStats class used for CTC-based ASR evaluation, but in the Whisper context, hypotheses are first decoded via tokenizer.decode(t, skip_special_tokens=True) and normalized using tokenizer.normalize(text) before being passed to the metric computation.

Constructor

WER Computer

from speechbrain.utils.metric_stats import ErrorRateStats

# WER: default configuration (word-level comparison)
wer_metric = ErrorRateStats()

CER Computer

# CER: split_tokens=True splits words into characters
cer_metric = ErrorRateStats(split_tokens=True)

Parameters

Parameter	Type	Default	Description
merge_tokens	bool	False	If True, merges successive tokens into words (e.g., character-to-word)
split_tokens	bool	False	If True, splits tokens into characters (set True for CER)
space_token	str	"_"	Token used as word boundary for merge/split operations
keep_values	bool	True	Whether to keep concept values (for concept error rate)
extract_concepts_values	bool	False	Whether to extract concepts and values from predict/target
tag_in	str	""	Start tag for concept extraction
tag_out	str	""	End tag for concept extraction
equality_comparator	Callable	_str_equals	Function to compare two tokens for equality

YAML Configuration

# WER computer (word-level)
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

# CER computer (character-level)
cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
    split_tokens: True

Usage in Whisper Fine-Tuning

The key difference from CTC evaluation is how hypotheses and targets are prepared before passing them to ErrorRateStats:

# In ASR.compute_objectives():

# Step 1: Decode hypothesis token IDs to text strings
predicted_words = [
    self.tokenizer.decode(t, skip_special_tokens=True).strip()
    for t in hyps
]

# Step 2: Decode target token IDs to text strings
target_words = undo_padding(tokens, tokens_lens)
target_words = self.tokenizer.batch_decode(
    target_words, skip_special_tokens=True
)

# Step 3: Apply Whisper-specific normalization
if hasattr(self.hparams, "normalized_transcripts"):
    predicted_words = [
        self.tokenizer.normalize(text).split(" ")
        for text in predicted_words
    ]
    target_words = [
        self.tokenizer.normalize(text).split(" ")
        for text in target_words
    ]
else:
    predicted_words = [text.split(" ") for text in predicted_words]
    target_words = [text.split(" ") for text in target_words]

# Step 4: Append to metrics (word lists, not tensors)
self.wer_metric.append(ids, predicted_words, target_words)
self.cer_metric.append(ids, predicted_words, target_words)

append() Method

def append(
    self,
    ids,           # list of utterance ID strings
    predict,       # list of predicted word lists: [["hello", "world"], ...]
    target,        # list of target word lists: [["hello", "world"], ...]
    predict_len=None,   # optional relative lengths for unpadding
    target_len=None,    # optional relative lengths for unpadding
    ind2lab=None,       # optional index-to-label mapping function
):

When called from the Whisper recipe, predict and target are already lists of word lists (after tokenizer decoding and normalization), so predict_len, target_len, and ind2lab are not needed.

Internally, append computes edit distance alignments using wer_details_for_batch and stores per-utterance scores.

summarize() Method

stats = wer_metric.summarize()
# Returns:
# {
#     "WER": 15.3,           # Word Error Rate as percentage
#     "error_rate": 15.3,    # Same as WER (generic alias)
#     "insertions": 42,      # Total insertion errors
#     "deletions": 31,       # Total deletion errors
#     "substitutions": 67,   # Total substitution errors
#     "num_ref_tokens": ..., # Total reference tokens
#     ...
# }

# Access specific field:
wer_value = wer_metric.summarize("error_rate")
# Returns: 15.3

Lifecycle in Training

# on_stage_start: Initialize fresh metric computers
def on_stage_start(self, stage, epoch):
    if stage != sb.Stage.TRAIN:
        self.cer_metric = self.hparams.cer_computer()
        self.wer_metric = self.hparams.error_rate_computer()

# on_stage_end: Summarize and log
def on_stage_end(self, stage, stage_loss, epoch):
    stage_stats = {"loss": stage_loss}
    if stage != sb.Stage.TRAIN:
        stage_stats["CER"] = self.cer_metric.summarize("error_rate")
        stage_stats["WER"] = self.wer_metric.summarize("error_rate")

    if stage == sb.Stage.VALID:
        # Save checkpoint keyed on WER
        self.checkpointer.save_and_keep_only(
            meta={"WER": stage_stats["WER"]},
            min_keys=["WER"],
        )
    elif stage == sb.Stage.TEST:
        # Write detailed WER statistics to file
        with open(self.hparams.test_wer_file, "w") as w:
            self.wer_metric.write_stats(w)

write_stats() Method

Writes detailed alignment information to a file, including per-utterance alignments showing insertions, deletions, and substitutions:

with open("wer_test.txt", "w", encoding="utf-8") as f:
    wer_metric.write_stats(f)

This calls print_wer_summary and print_alignments from speechbrain.dataio.wer to produce a human-readable report.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment