Principle:Openai Whisper Single Segment Decoding

Overview

Single Segment Decoding is the low-level operation where a mel spectrogram of exactly 30 seconds is processed through the encoder-decoder transformer to produce text tokens autoregressively. This is the fundamental unit of speech recognition in Whisper: one fixed-length audio segment in, one sequence of text tokens out.

The 30-second constraint arises from the encoder's fixed-length sinusoidal positional embeddings, which expect exactly 3000 frames (30 seconds at 100 frames per second) of mel spectrogram input.

The Full Decode Pipeline

Single segment decoding proceeds through a well-defined series of stages:

Stage 1: Audio Encoding

The 80-channel mel spectrogram of shape (80, 3000) is passed through the Whisper encoder (a stack of transformer blocks with convolutional input layers). The encoder produces a sequence of audio feature vectors that capture acoustic and linguistic information. These features are consumed by the decoder via cross-attention at every decoder layer.

Stage 2: Initial Token Setup

The decoder requires an initial sequence of special tokens to condition its output. This sequence includes:

Start-of-transcript token (sot)
Language token — encodes the spoken language (e.g., <|en|>)
Task token — either <|transcribe|> or <|translate|>
Optional timestamp token if timestamps are enabled
Optional prompt tokens from previous segments or user input
Optional prefix tokens to force initial output

Stage 3: Language Detection (Optional)

If no language is specified, the model performs a forward pass with just the start-of-transcript token and examines the probability distribution over all language tokens. The highest-probability language is selected, and its token is inserted into the initial sequence.

Stage 4: Autoregressive Token Generation

The main decoding loop generates tokens one at a time:

The decoder receives all previously generated tokens plus the audio features from the encoder.
It produces logits (unnormalized scores) over the full vocabulary for the next token position.
Logit filters are applied to suppress unwanted tokens (blank tokens at the start, non-speech symbols, timestamp constraint violations).
The decoding strategy (greedy, sampling, or beam search) selects the next token(s) from the filtered logits.
The selected token is appended to the sequence.
The loop repeats until an end-of-text token is generated or the maximum sequence length is reached.

Key-value caching is used to avoid recomputing attention over all previous tokens at each step. Only the most recent token needs to be processed through the decoder, with cached key-value pairs from prior steps reused.

Stage 5: Sequence Ranking

When multiple candidate sequences are generated (via beam search or n-best sampling), they must be ranked. The maximum likelihood ranker selects the sequence with the highest average log probability, optionally adjusted by a length penalty.

Logit Filtering

Several filters are applied to the raw logits before token selection:

SuppressBlank — Prevents blank and end-of-text tokens at the very beginning of decoding.
SuppressTokens — Assigns negative infinity to a predefined list of non-speech token IDs.
ApplyTimestampRules — Enforces constraints on timestamp tokens, such as requiring timestamps to be monotonically increasing and alternating between text and timestamp regions.

Relationship to the Full Pipeline

Single segment decoding handles one 30-second window. For audio longer than 30 seconds, the higher-level transcription pipeline calls this operation repeatedly with a sliding window, using the output of each segment to condition the next.

References

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2209.11302

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment