Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain Beam Search Decoding

From Leeroopedia


Field Value
Concept Using beam search to decode optimal token sequences from autoregressive models
Domains Decoding, ASR, Sequence_Modeling
Related Implementation Implementation:Speechbrain_Speechbrain_S2SWhisperBeamSearcher

Overview

Beam search is a heuristic search algorithm that explores the space of possible output sequences by maintaining a fixed number of the most promising partial hypotheses at each decoding step. For Whisper ASR, SpeechBrain provides specialized beam search and greedy search implementations that handle Whisper's unique token structure, including language tokens, task tokens, and timestamp tokens.

Algorithm

At each decoding step t:

  1. Expand: For each of the B (beam_size) active hypotheses, compute the probability distribution over the entire vocabulary by running a forward pass through the decoder.
  2. Score: Extend each hypothesis with each possible next token, computing the cumulative log-probability of the resulting sequences.
  3. Prune: From all B * V candidate extensions (where V is the vocabulary size), keep only the top B sequences by cumulative score.
  4. Terminate: Hypotheses that emit the EOS token are moved to the completed set.
  5. Repeat: Continue until all beams have terminated or the maximum length is reached.

Greedy search is the special case where B = 1, always selecting the single highest-probability token at each step.

Whisper-Specific Decoding

Whisper's searchers implement several model-specific features:

Initial Tokens

Decoding begins with a sequence of special tokens that configure the model's behavior:

[<|startoftranscript|>] [<|language|>] [<|task|>] [<|notimestamps|>]

These tokens are set as the initial decoder memory before the search begins. The last initial token serves as the first input to the decoder.

KV Caching

Key-value (KV) caching stores the computed attention keys and values from previous decoding steps. This avoids redundant computation:

  • At step t, only the new token needs to be processed through the decoder.
  • The cached keys and values from steps 0 to t-1 are reused.
  • For beam search, the KV cache must be reordered when beams are pruned (via _reorder_cache).
  • This provides significant speedups, especially for long sequences.

Token Suppression

Several categories of tokens are suppressed (assigned negative infinity log-probability) during decoding:

  • Blank suppression: At the first decoding step, blank tokens and EOS are suppressed to prevent empty outputs.
  • Non-speech tokens: Symbols like musical notes, speaker tags, and formatting characters are suppressed to keep outputs clean.
  • Task/language tokens: The transcribe, translate, BOS, and related special tokens are suppressed during normal decoding to prevent the model from generating control tokens as output.

Temperature Scaling

Temperature controls the sharpness of the probability distribution:

  • temperature = 0.0 (greedy): Takes the argmax directly, producing deterministic outputs.
  • temperature = 1.0: Uses the raw model probabilities, allowing more diversity.
  • temperature < 1.0: Sharpens the distribution, making the model more confident.
  • temperature > 1.0: Flattens the distribution, increasing randomness.

For beam search, the log probabilities are divided by the temperature before scoring.

Greedy vs. Beam Search in Practice

The Whisper fine-tuning recipe uses different searchers at different stages:

  • Validation: S2SWhisperGreedySearcher is used for speed. Greedy search is fast and provides a reasonable approximation of model quality during training.
  • Testing: S2SWhisperBeamSearcher with beam_size=8 is used for final evaluation. Beam search explores more hypotheses and typically yields 1-3% lower WER compared to greedy search.

Maximum Length Control

The searchers enforce a maximum output length based on the decoder's maximum attention window (max_attn_tokens). The actual maximum number of generated tokens is max_attn_tokens - sample_begin, where sample_begin is the number of initial special tokens. This prevents the decoder from generating infinitely long sequences.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment