Workflow:Speechbrain Speechbrain Speech Enhancement Training
| Knowledge Sources | |
|---|---|
| Domains | Speech_Enhancement, Speech_Processing, Deep_Learning |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
End-to-end process for training speech enhancement models using MetricGAN, spectral masking, or waveform mapping approaches to remove noise and improve speech quality within SpeechBrain.
Description
This workflow covers the procedure for training speech enhancement systems that remove noise or reverberation from degraded speech signals. SpeechBrain provides multiple enhancement approaches: MetricGAN/MetricGAN-U (GAN-based optimization of perceptual metrics like PESQ), spectral masking (predicting time-frequency masks), waveform mapping (direct waveform-domain processing), and SEGAN (adversarial waveform enhancement). The MetricGAN approach is distinctive in that it directly optimizes perceptual quality metrics through a learned discriminator, while spectral masking and waveform methods use conventional regression losses. The workflow is demonstrated on the Voicebank-DEMAND dataset and DNS Challenge data.
Usage
Execute this workflow when you have noisy or reverberant speech recordings and need to train a model to produce clean, enhanced audio. This is appropriate for preprocessing noisy audio before ASR, improving audio quality for communication systems, or building speech enhancement components for hearing aids and conferencing applications. The choice of method depends on the target metric: MetricGAN for optimizing PESQ/STOI directly, spectral masking for general denoising, or SepFormer for high-quality enhancement on larger datasets like DNS.
Execution Steps
Step 1: Noisy Speech Data Preparation
Prepare parallel clean-noisy speech data from the target dataset. For the Voicebank-DEMAND corpus, the preparation script parses the directory structure to create manifest files mapping each noisy utterance to its clean reference. For DNS Challenge data, the preparation involves downloading, decompressing, synthesizing noisy-clean pairs using the noise synthesizer, and optionally creating WebDataset shards for efficient large-scale loading.
Key considerations:
- Parallel data requires matched clean and noisy versions of each utterance
- The Voicebank preparation handles the standard noisy-clean directory structure
- DNS data preparation involves a multi-step pipeline: download, decompress, synthesize, shard
- Duration-based filtering removes very short or very long utterances
Step 2: Enhancement Architecture Selection
Select and configure the enhancement architecture via HyperPyYAML. SpeechBrain provides four main approaches, each with distinct model architectures:
MetricGAN/MetricGAN-U: Generator-discriminator pair where the discriminator learns to predict perceptual metrics. The generator processes spectral features and predicts enhanced magnitude spectra.
Spectral masking: A neural network (BLSTM, CNNTransformer, 2D-FCN) predicts a time-frequency mask that is element-wise multiplied with the noisy spectrogram to obtain the enhanced signal.
Waveform mapping: A fully convolutional network operates directly on the raw waveform, learning a mapping from noisy to clean speech without spectral decomposition.
SepFormer: The dual-path transformer architecture (also used for separation) applied to single-channel enhancement, particularly on DNS Challenge data.
Step 3: GAN Training With Sub-stage Management (MetricGAN)
For MetricGAN-based enhancement, training involves a custom Brain subclass with adversarial sub-stage management. The fit_batch() method alternates between three sub-stages: training the discriminator on historical enhanced samples (HISTORICAL), training the discriminator on current batch outputs (CURRENT), and training the generator to fool the discriminator (GENERATOR). A history buffer stores previously enhanced utterances for discriminator stabilization.
Key considerations:
- The discriminator is trained on both current and historical enhanced samples
- The generator loss combines adversarial loss (fooling the discriminator) with optional reconstruction loss
- MetricGAN-U extends this by using unsupervised metrics (no clean reference needed for discriminator)
- Multiple optimizers require custom init_optimizers() and zero_grad() methods
Step 4: Conventional Training (Non-GAN Methods)
For spectral masking, waveform mapping, and SepFormer, training follows the standard Brain pattern with compute_forward() and compute_objectives(). The forward pass transforms noisy input through the model, and the loss measures the distance between enhanced and clean signals. STFT-domain losses (L1 on magnitude) are used for spectral methods, while time-domain losses (L1 or SI-SNR) are used for waveform methods.
Key considerations:
- Spectral mask methods apply STFT, predict mask, apply mask, then inverse STFT
- Waveform methods process raw audio end-to-end
- Loss functions can combine time-domain and frequency-domain components
- Gradient clipping prevents instability during early training
Step 5: Perceptual Quality Evaluation
Evaluate the trained model using standard speech quality metrics. The enhancement pipeline processes each noisy test utterance and compares the output against the clean reference. Evaluation metrics include PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), composite measures (CSIG, CBAK, COVL), and optionally DNSMOS for non-intrusive quality estimation.
Key considerations:
- PESQ and STOI require clean reference signals (intrusive metrics)
- DNSMOS provides reference-free quality estimation using a neural network
- Composite measures aggregate signal quality, background noise, and overall quality
- Evaluation scripts compute all metrics and write detailed per-utterance results