Principle:Speechbrain Speechbrain Conventional Enhancement Training

Property	Value
Principle Name	Conventional_Enhancement_Training
Workflow	Speech_Enhancement_Training
Domains	Speech_Enhancement, Deep_Learning
Source Repository	speechbrain/speechbrain
Related Implementation	Implementation:Speechbrain_Speechbrain_SEBrain_Compute_Forward

Overview

Conventional Enhancement Training refers to training speech enhancement models using standard supervised regression objectives, without adversarial (GAN) components. The model directly minimizes the distance between enhanced and clean speech in either the spectral or waveform domain. This approach is simpler, faster to train, and more stable than GAN-based methods, while still achieving competitive enhancement quality.

Theoretical Background

Spectral Mask Approach

The spectral mask approach is the most common conventional enhancement paradigm. It operates in the Short-Time Fourier Transform (STFT) domain through the following pipeline:

1. Compute STFT of noisy waveform:  noisy_spec = STFT(noisy_wav)
2. Extract spectral magnitude:       noisy_mag = |noisy_spec|^0.5
3. Apply log compression:            noisy_feats = log1p(noisy_mag)
4. Predict mask via neural network:  mask = model(noisy_feats)
5. Apply mask to noisy features:     enhanced_spec = mask * noisy_feats
6. Reconstruct waveform via ISTFT:   enhanced_wav = ISTFT(enhanced_spec, noisy_wav)

The key idea is that the model predicts a multiplicative mask with values in [0, 1] (enforced by a Sigmoid activation). A value near 1 in a time-frequency bin means "keep this component" (likely speech), while a value near 0 means "suppress this component" (likely noise).

Signal Approximation (SA)

The SpeechBrain implementation uses Signal Approximation (SA) rather than the more common Ideal Ratio Mask (IRM). In SA, the mask is applied to the log-compressed spectral magnitude and the loss is computed in the same domain:

SA Loss:  L = MSE(mask * log1p(|noisy_spec|^0.5), log1p(|clean_spec|^0.5))

This differs from IRM-based training where the mask is applied to the raw spectrogram and the loss may be computed in the waveform domain. SA has been shown to provide good performance with simpler training dynamics.

Waveform Mapping Approach

The alternative waveform mapping approach bypasses spectral decomposition entirely:

enhanced_wav = model(noisy_wav)
L = MSE(enhanced_wav, clean_wav)

This approach is conceptually simpler and avoids phase estimation issues, but requires the model to implicitly learn the spectral structure of speech. The SpeechBrain implementation supports switching between spectral mask and waveform targets via a single configuration flag:

waveform_target: False  # Set to True for waveform-domain loss

Loss Functions

Conventional enhancement training in SpeechBrain primarily uses:

MSE Loss (speechbrain.nnet.losses.mse_loss): Mean Squared Error between enhanced and clean spectrograms (or waveforms). This is the default and most stable choice.
STOI Loss (speechbrain.nnet.loss.stoi_loss.stoi_loss): A differentiable approximation of the Short-Time Objective Intelligibility metric. Can be used as an alternative training objective when waveform_target is True.

The MSE spectral loss provides smooth gradients and stable convergence. While it does not directly optimize perceptual quality metrics, it serves as a reliable proxy that generally improves PESQ and STOI scores.

Standard Brain.fit() Training Loop

Conventional enhancement training uses SpeechBrain's standard Brain.fit() training loop, which provides:

Epoch iteration: Loop over training epochs with early stopping
Batch processing: For each batch, call compute_forward() then compute_objectives()
Gradient computation: Automatic backpropagation through the loss
Optimizer step: Single optimizer updates model parameters
Validation: Periodic evaluation on validation set with PESQ/STOI metrics
Checkpointing: Save best model based on validation PESQ

This is significantly simpler than GAN-based training, which requires custom fit_batch() with dual optimizers and multiple sub-stages.

Validation Monitoring

During validation and testing, perceptual metrics are computed but not used as training objectives:

PESQ is computed using the external pesq package for monitoring
STOI is computed using SpeechBrain's differentiable STOI loss
The best checkpoint is selected based on validation PESQ (max_keys=["pesq"])

This means the model is optimized for spectral MSE but selected for perceptual quality, combining the stability of MSE training with perceptual quality-based model selection.

Key Design Decisions

Log-compressed spectral features: Using log1p(|STFT|^0.5) compresses the dynamic range of the spectrogram, making the MSE loss more perceptually relevant (since human hearing is approximately logarithmic in amplitude)
Sigmoid mask constraint: The [0, 1] mask constraint ensures the model can only attenuate spectral components, not amplify them. This prevents the model from introducing artifacts by boosting certain frequencies.
Phase reuse: The waveform is reconstructed by combining the enhanced magnitude with the original noisy phase. While this introduces some phase mismatch, it avoids the difficulty of phase estimation and works well in practice.
Ascending sort for efficiency: Training data is sorted by utterance length in ascending order, which minimizes padding waste in batched training.

Training Configuration

Typical hyperparameters for conventional enhancement training:

Parameter	Value	Rationale
Epochs	50	Converges faster than GAN-based (750 epochs)
Batch size	8	Larger batches are feasible without per-sample PESQ computation
Learning rate	0.0001	Standard Adam learning rate
FFT size	512	32 ms window at 16 kHz
Hop length	16 ms	50% overlap
Window	Hamming	Standard choice for speech processing
Loss	MSE	Spectral domain MSE (default)

Comparison with GAN-Based Training

Aspect	Conventional	GAN-Based (MetricGAN+)
Training stability	High (convex-like loss landscape)	Lower (adversarial dynamics)
Training speed	Fast (50 epochs, large batches)	Slow (750 epochs, batch_size=1)
Implementation complexity	Low (standard Brain.fit)	High (custom fit_batch, sub-stages)
PESQ optimization	Indirect (via MSE proxy)	Direct (discriminator predicts PESQ)
Peak PESQ performance	Good	Better (specifically optimized)
Debugging ease	Easy (single loss, single optimizer)	Hard (multi-stage, dual optimizer)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment