Workflow:Speechbrain Speechbrain Whisper ASR Finetuning
| Knowledge Sources | |
|---|---|
| Domains | ASR, Transfer_Learning, Speech_Processing |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
End-to-end process for fine-tuning OpenAI's Whisper speech recognition model on domain-specific or language-specific data using SpeechBrain's Brain framework.
Description
This workflow covers the procedure for adapting OpenAI's pretrained Whisper model to specific languages or domains using SpeechBrain. Whisper is a large-scale multilingual ASR model pretrained on 680,000 hours of web audio. Fine-tuning adapts the encoder-decoder architecture to a target domain while preserving the pretrained knowledge. The process uses SpeechBrain's HuggingFace integration to load the Whisper model, applies the Whisper tokenizer for consistent text processing, and trains with negative log-likelihood loss. The recipe is demonstrated on CommonVoice across multiple languages and supports optional waveform augmentation.
Usage
Execute this workflow when you need to improve Whisper's performance on a specific language, accent, domain, or vocabulary that is underrepresented in its pretraining data. This is appropriate when you have a moderate amount of labeled speech data (tens to hundreds of hours) and want to leverage Whisper's pretrained multilingual knowledge. Fine-tuning is preferable to training from scratch when the target domain has limited data but shares acoustic characteristics with Whisper's pretraining distribution.
Execution Steps
Step 1: Target Dataset Preparation
Prepare the target domain dataset using the appropriate preparation script. For CommonVoice, this involves converting TSV metadata files into SpeechBrain CSV manifests with audio paths, durations, and transcriptions. The preparation handles language-specific filtering, text normalization, and train/dev/test split creation.
Key considerations:
- Text transcriptions must be compatible with Whisper's tokenizer vocabulary
- Audio files may need resampling to 16kHz to match Whisper's expected input
- Language-specific preparation handles character set normalization
- Dataset statistics (total hours, vocabulary size) inform fine-tuning hyperparameters
Step 2: Whisper Model and Tokenizer Loading
Load the pretrained Whisper model and its associated tokenizer through SpeechBrain's HuggingFace interface. The configuration specifies the Whisper variant (tiny, base, small, medium, large) and sets up the encoder-decoder architecture with the appropriate language and task tokens. The tokenizer handles special token insertion, padding, and BOS/EOS management specific to Whisper's format.
Key considerations:
- Whisper model size should match available GPU memory (large models need significant VRAM)
- The tokenizer's language and task tokens must be set correctly for the target language
- Padding token handling requires special attention to avoid training on padding positions
- The model is loaded through HuggingFace's transformers integration
Step 3: Data Pipeline With Whisper Tokenization
Construct the data pipeline that processes audio signals and tokenizes text using Whisper's tokenizer. The audio pipeline reads and optionally augments waveforms. The text pipeline applies Whisper's tokenizer to convert transcriptions into token sequences with appropriate special tokens (BOS, EOS, language token, task token). Padding masks are computed to exclude padding positions from loss computation.
Key considerations:
- Whisper expects specific special token sequences at the start of each target sequence
- Padding must be handled carefully to avoid gradient corruption
- Optional waveform augmentation (speed perturbation, noise addition) requires label duplication
- The tokenizer's batch_decode() is used to convert predictions back to text
Step 4: Fine_tuning With Learning Rate Scheduling
Train the Whisper model using the Brain framework with a carefully configured learning rate schedule. The entire encoder-decoder is fine-tuned with a single optimizer using a low learning rate to preserve pretrained knowledge. Learning rate warmup prevents catastrophic forgetting in early training, followed by gradual decay. The loss is negative log-likelihood computed on the decoder output logits.
Key considerations:
- A low learning rate (1e-5 range) prevents catastrophic forgetting of pretrained knowledge
- Warmup steps gradually increase the learning rate from near-zero
- The entire model is typically fine-tuned (not just the decoder)
- Gradient accumulation enables larger effective batch sizes on limited GPU memory
- Mixed precision training is recommended for larger Whisper variants
Step 5: Decoding With Beam Search
During validation and testing, generate transcription hypotheses using beam search decoding. The decoder autoregressively generates token sequences conditioned on the encoder output, exploring multiple hypotheses in parallel. Separate search configurations can be used for validation (faster, narrower beam) and testing (wider beam for best quality).
Key considerations:
- Beam width trades off between decoding quality and speed
- Validation uses a narrower beam for faster iteration during training
- Test decoding uses a wider beam for optimal final results
- Whisper's built-in language detection and timestamp prediction can optionally be enabled
- Text normalization is applied before computing metrics for fair comparison
Step 6: WER and CER Evaluation
Evaluate the fine-tuned model by computing Word Error Rate (WER) and Character Error Rate (CER) on the test set. Decoded hypotheses are compared against reference transcriptions after optional text normalization. Per-utterance results are written to an output file, and aggregate statistics are reported.
Key considerations:
- Text normalization standardizes casing, punctuation, and number formatting before comparison
- WER measures word-level accuracy; CER measures character-level accuracy
- Results should be compared against both the baseline Whisper (without fine-tuning) and other methods
- The best checkpoint is selected based on validation WER before final evaluation