Workflow:Speechbrain Speechbrain Speech Enhancement Training

Knowledge Sources	SpeechBrain SpeechBrain Docs
Domains	Speech_Enhancement, Speech_Processing, Deep_Learning
Last Updated	2026-02-09 19:00 GMT

Overview

End-to-end process for training speech enhancement models using MetricGAN, spectral masking, or waveform mapping approaches to remove noise and improve speech quality within SpeechBrain.

Description

This workflow covers the procedure for training speech enhancement systems that remove noise or reverberation from degraded speech signals. SpeechBrain provides multiple enhancement approaches: MetricGAN/MetricGAN-U (GAN-based optimization of perceptual metrics like PESQ), spectral masking (predicting time-frequency masks), waveform mapping (direct waveform-domain processing), and SEGAN (adversarial waveform enhancement). The MetricGAN approach is distinctive in that it directly optimizes perceptual quality metrics through a learned discriminator, while spectral masking and waveform methods use conventional regression losses. The workflow is demonstrated on the Voicebank-DEMAND dataset and DNS Challenge data.

Usage

Execute this workflow when you have noisy or reverberant speech recordings and need to train a model to produce clean, enhanced audio. This is appropriate for preprocessing noisy audio before ASR, improving audio quality for communication systems, or building speech enhancement components for hearing aids and conferencing applications. The choice of method depends on the target metric: MetricGAN for optimizing PESQ/STOI directly, spectral masking for general denoising, or SepFormer for high-quality enhancement on larger datasets like DNS.

Execution Steps

Step 1: Noisy Speech Data Preparation

Prepare parallel clean-noisy speech data from the target dataset. For the Voicebank-DEMAND corpus, the preparation script parses the directory structure to create manifest files mapping each noisy utterance to its clean reference. For DNS Challenge data, the preparation involves downloading, decompressing, synthesizing noisy-clean pairs using the noise synthesizer, and optionally creating WebDataset shards for efficient large-scale loading.

Key considerations:

Parallel data requires matched clean and noisy versions of each utterance
The Voicebank preparation handles the standard noisy-clean directory structure
DNS data preparation involves a multi-step pipeline: download, decompress, synthesize, shard
Duration-based filtering removes very short or very long utterances

Step 2: Enhancement Architecture Selection

Select and configure the enhancement architecture via HyperPyYAML. SpeechBrain provides four main approaches, each with distinct model architectures:

MetricGAN/MetricGAN-U: Generator-discriminator pair where the discriminator learns to predict perceptual metrics. The generator processes spectral features and predicts enhanced magnitude spectra.

Spectral masking: A neural network (BLSTM, CNNTransformer, 2D-FCN) predicts a time-frequency mask that is element-wise multiplied with the noisy spectrogram to obtain the enhanced signal.

Waveform mapping: A fully convolutional network operates directly on the raw waveform, learning a mapping from noisy to clean speech without spectral decomposition.

SepFormer: The dual-path transformer architecture (also used for separation) applied to single-channel enhancement, particularly on DNS Challenge data.

Step 3: GAN Training With Sub-stage Management (MetricGAN)

For MetricGAN-based enhancement, training involves a custom Brain subclass with adversarial sub-stage management. The fit_batch() method alternates between three sub-stages: training the discriminator on historical enhanced samples (HISTORICAL), training the discriminator on current batch outputs (CURRENT), and training the generator to fool the discriminator (GENERATOR). A history buffer stores previously enhanced utterances for discriminator stabilization.

Key considerations:

The discriminator is trained on both current and historical enhanced samples
The generator loss combines adversarial loss (fooling the discriminator) with optional reconstruction loss
MetricGAN-U extends this by using unsupervised metrics (no clean reference needed for discriminator)
Multiple optimizers require custom init_optimizers() and zero_grad() methods

Step 4: Conventional Training (Non-GAN Methods)

For spectral masking, waveform mapping, and SepFormer, training follows the standard Brain pattern with compute_forward() and compute_objectives(). The forward pass transforms noisy input through the model, and the loss measures the distance between enhanced and clean signals. STFT-domain losses (L1 on magnitude) are used for spectral methods, while time-domain losses (L1 or SI-SNR) are used for waveform methods.

Key considerations:

Spectral mask methods apply STFT, predict mask, apply mask, then inverse STFT
Waveform methods process raw audio end-to-end
Loss functions can combine time-domain and frequency-domain components
Gradient clipping prevents instability during early training

Step 5: Perceptual Quality Evaluation

Evaluate the trained model using standard speech quality metrics. The enhancement pipeline processes each noisy test utterance and compares the output against the clean reference. Evaluation metrics include PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), composite measures (CSIG, CBAK, COVL), and optionally DNSMOS for non-intrusive quality estimation.

Key considerations:

PESQ and STOI require clean reference signals (intrusive metrics)
DNSMOS provides reference-free quality estimation using a neural network
Composite measures aggregate signal quality, background noise, and overall quality
Evaluation scripts compute all metrics and write detailed per-utterance results

Execution Diagram

GitHub URL

Workflow Repository