Principle:Speechbrain Speechbrain Conventional Enhancement Training
| Property | Value |
|---|---|
| Principle Name | Conventional_Enhancement_Training |
| Workflow | Speech_Enhancement_Training |
| Domains | Speech_Enhancement, Deep_Learning |
| Source Repository | speechbrain/speechbrain |
| Related Implementation | Implementation:Speechbrain_Speechbrain_SEBrain_Compute_Forward |
Overview
Conventional Enhancement Training refers to training speech enhancement models using standard supervised regression objectives, without adversarial (GAN) components. The model directly minimizes the distance between enhanced and clean speech in either the spectral or waveform domain. This approach is simpler, faster to train, and more stable than GAN-based methods, while still achieving competitive enhancement quality.
Theoretical Background
Spectral Mask Approach
The spectral mask approach is the most common conventional enhancement paradigm. It operates in the Short-Time Fourier Transform (STFT) domain through the following pipeline:
1. Compute STFT of noisy waveform: noisy_spec = STFT(noisy_wav)
2. Extract spectral magnitude: noisy_mag = |noisy_spec|^0.5
3. Apply log compression: noisy_feats = log1p(noisy_mag)
4. Predict mask via neural network: mask = model(noisy_feats)
5. Apply mask to noisy features: enhanced_spec = mask * noisy_feats
6. Reconstruct waveform via ISTFT: enhanced_wav = ISTFT(enhanced_spec, noisy_wav)
The key idea is that the model predicts a multiplicative mask with values in [0, 1] (enforced by a Sigmoid activation). A value near 1 in a time-frequency bin means "keep this component" (likely speech), while a value near 0 means "suppress this component" (likely noise).
Signal Approximation (SA)
The SpeechBrain implementation uses Signal Approximation (SA) rather than the more common Ideal Ratio Mask (IRM). In SA, the mask is applied to the log-compressed spectral magnitude and the loss is computed in the same domain:
SA Loss: L = MSE(mask * log1p(|noisy_spec|^0.5), log1p(|clean_spec|^0.5))
This differs from IRM-based training where the mask is applied to the raw spectrogram and the loss may be computed in the waveform domain. SA has been shown to provide good performance with simpler training dynamics.
Waveform Mapping Approach
The alternative waveform mapping approach bypasses spectral decomposition entirely:
enhanced_wav = model(noisy_wav)
L = MSE(enhanced_wav, clean_wav)
This approach is conceptually simpler and avoids phase estimation issues, but requires the model to implicitly learn the spectral structure of speech. The SpeechBrain implementation supports switching between spectral mask and waveform targets via a single configuration flag:
waveform_target: False # Set to True for waveform-domain loss
Loss Functions
Conventional enhancement training in SpeechBrain primarily uses:
- MSE Loss (
speechbrain.nnet.losses.mse_loss): Mean Squared Error between enhanced and clean spectrograms (or waveforms). This is the default and most stable choice. - STOI Loss (
speechbrain.nnet.loss.stoi_loss.stoi_loss): A differentiable approximation of the Short-Time Objective Intelligibility metric. Can be used as an alternative training objective when waveform_target is True.
The MSE spectral loss provides smooth gradients and stable convergence. While it does not directly optimize perceptual quality metrics, it serves as a reliable proxy that generally improves PESQ and STOI scores.
Standard Brain.fit() Training Loop
Conventional enhancement training uses SpeechBrain's standard Brain.fit() training loop, which provides:
- Epoch iteration: Loop over training epochs with early stopping
- Batch processing: For each batch, call
compute_forward()thencompute_objectives() - Gradient computation: Automatic backpropagation through the loss
- Optimizer step: Single optimizer updates model parameters
- Validation: Periodic evaluation on validation set with PESQ/STOI metrics
- Checkpointing: Save best model based on validation PESQ
This is significantly simpler than GAN-based training, which requires custom fit_batch() with dual optimizers and multiple sub-stages.
Validation Monitoring
During validation and testing, perceptual metrics are computed but not used as training objectives:
- PESQ is computed using the external
pesqpackage for monitoring - STOI is computed using SpeechBrain's differentiable STOI loss
- The best checkpoint is selected based on validation PESQ (
max_keys=["pesq"])
This means the model is optimized for spectral MSE but selected for perceptual quality, combining the stability of MSE training with perceptual quality-based model selection.
Key Design Decisions
- Log-compressed spectral features: Using
log1p(|STFT|^0.5)compresses the dynamic range of the spectrogram, making the MSE loss more perceptually relevant (since human hearing is approximately logarithmic in amplitude) - Sigmoid mask constraint: The [0, 1] mask constraint ensures the model can only attenuate spectral components, not amplify them. This prevents the model from introducing artifacts by boosting certain frequencies.
- Phase reuse: The waveform is reconstructed by combining the enhanced magnitude with the original noisy phase. While this introduces some phase mismatch, it avoids the difficulty of phase estimation and works well in practice.
- Ascending sort for efficiency: Training data is sorted by utterance length in ascending order, which minimizes padding waste in batched training.
Training Configuration
Typical hyperparameters for conventional enhancement training:
| Parameter | Value | Rationale |
|---|---|---|
| Epochs | 50 | Converges faster than GAN-based (750 epochs) |
| Batch size | 8 | Larger batches are feasible without per-sample PESQ computation |
| Learning rate | 0.0001 | Standard Adam learning rate |
| FFT size | 512 | 32 ms window at 16 kHz |
| Hop length | 16 ms | 50% overlap |
| Window | Hamming | Standard choice for speech processing |
| Loss | MSE | Spectral domain MSE (default) |
Comparison with GAN-Based Training
| Aspect | Conventional | GAN-Based (MetricGAN+) |
|---|---|---|
| Training stability | High (convex-like loss landscape) | Lower (adversarial dynamics) |
| Training speed | Fast (50 epochs, large batches) | Slow (750 epochs, batch_size=1) |
| Implementation complexity | Low (standard Brain.fit) | High (custom fit_batch, sub-stages) |
| PESQ optimization | Indirect (via MSE proxy) | Direct (discriminator predicts PESQ) |
| Peak PESQ performance | Good | Better (specifically optimized) |
| Debugging ease | Easy (single loss, single optimizer) | Hard (multi-stage, dual optimizer) |
See Also
- Implementation:Speechbrain_Speechbrain_SEBrain_Compute_Forward -- The concrete implementation of conventional enhancement training
- Principle:Speechbrain_Speechbrain_GAN_Based_Enhancement_Training -- The alternative GAN-based training approach
- Principle:Speechbrain_Speechbrain_Enhancement_Architecture_Selection -- How different architectures are selected for conventional training
- Principle:Speechbrain_Speechbrain_Perceptual_Quality_Evaluation -- Metrics used to evaluate trained models