Principle:Speechbrain Speechbrain Whisper Model Loading

Field	Value
Concept	Loading and configuring pretrained Whisper models via HuggingFace Transformers interface for fine-tuning
Domains	Model_Architecture, ASR, Transfer_Learning
Knowledge Sources	Radford et al. 2023 "Robust Speech Recognition via Large-Scale Weak Supervision" (https://arxiv.org/abs/2212.04356)
Related Implementation	Implementation:Speechbrain_Speechbrain_Whisper_HFTransformersInterface

Overview

OpenAI's Whisper is a large-scale multitask speech model trained on 680,000 hours of weakly supervised web audio. SpeechBrain provides a wrapper class (Whisper) that integrates Whisper models through the HuggingFace Transformers interface, enabling seamless fine-tuning within SpeechBrain's training framework. This wrapper handles model downloading, weight initialization, tokenizer configuration, and provides structured access to the encoder-decoder architecture.

Architecture

Whisper is an encoder-decoder Transformer architecture:

Encoder: Processes log-mel spectrogram features (80 mel frequency bins from 30-second audio chunks) through a stack of Transformer encoder layers. The encoder converts raw audio waveforms into high-dimensional contextual representations.
Decoder: An autoregressive Transformer decoder that generates text tokens conditioned on encoder states. It uses special tokens to specify language, task (transcribe or translate), and timestamps.
Tokenizer: A byte-level BPE tokenizer with special tokens including <|startoftranscript|>, <|endoftext|>, language tokens (e.g., <|en|>, <|fr|>), task tokens (<|transcribe|>, <|translate|>), and timestamp tokens.

Model Variants

Whisper comes in several sizes, all accessible via HuggingFace Hub identifiers:

openai/whisper-tiny (39M parameters)
openai/whisper-base (74M parameters)
openai/whisper-small (244M parameters)
openai/whisper-medium (769M parameters)
openai/whisper-large-v2 (1550M parameters)

Models ending in ".en" are English-only and must not have language or task tokens set.

Fine-Tuning Strategies

The wrapper supports several fine-tuning strategies through its configuration parameters:

Full fine-tuning (freeze=False, freeze_encoder=False): All parameters are trainable. Most flexible but requires the most GPU memory.
Encoder freezing (freeze=False, freeze_encoder=True): Only the decoder is fine-tuned. This is the recommended approach for language-specific adaptation, as the encoder's acoustic representations are already robust, and fine-tuning only the decoder adapts the language model component.
Full freezing (freeze=True): The entire model is frozen and used as a feature extractor. Useful for downstream tasks that add new heads on top of Whisper representations.
Encoder-only mode (encoder_only=True): The decoder is deleted from memory entirely, and only encoder hidden states are returned. Useful for using Whisper as an audio feature extractor.

Multilingual Configuration

For multilingual models, the tokenizer must be configured with the correct language and task prefix tokens:

The language parameter (e.g., "fr" for French) sets the language token that tells the decoder which language to generate.
The task parameter ("transcribe" or "translate") controls whether the model outputs text in the source language or translates to English.
Non-multilingual models (those with vocab_size < 51865) must not have language or task tokens set.

Mel Spectrogram Processing

The wrapper handles mel spectrogram computation internally:

Audio waveforms are padded or trimmed to exactly 480,000 samples (30 seconds at 16kHz).
A Short-Time Fourier Transform (STFT) with n_fft=400 and hop_length=160 is applied.
The magnitude spectrum is projected onto 80 mel filter banks.
Log-mel features are computed with clamping at 1e-10 and normalization to the range [-1, 1].

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment