Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain Conventional Enhancement Training

From Leeroopedia


Property Value
Principle Name Conventional_Enhancement_Training
Workflow Speech_Enhancement_Training
Domains Speech_Enhancement, Deep_Learning
Source Repository speechbrain/speechbrain
Related Implementation Implementation:Speechbrain_Speechbrain_SEBrain_Compute_Forward

Overview

Conventional Enhancement Training refers to training speech enhancement models using standard supervised regression objectives, without adversarial (GAN) components. The model directly minimizes the distance between enhanced and clean speech in either the spectral or waveform domain. This approach is simpler, faster to train, and more stable than GAN-based methods, while still achieving competitive enhancement quality.

Theoretical Background

Spectral Mask Approach

The spectral mask approach is the most common conventional enhancement paradigm. It operates in the Short-Time Fourier Transform (STFT) domain through the following pipeline:

1. Compute STFT of noisy waveform:  noisy_spec = STFT(noisy_wav)
2. Extract spectral magnitude:       noisy_mag = |noisy_spec|^0.5
3. Apply log compression:            noisy_feats = log1p(noisy_mag)
4. Predict mask via neural network:  mask = model(noisy_feats)
5. Apply mask to noisy features:     enhanced_spec = mask * noisy_feats
6. Reconstruct waveform via ISTFT:   enhanced_wav = ISTFT(enhanced_spec, noisy_wav)

The key idea is that the model predicts a multiplicative mask with values in [0, 1] (enforced by a Sigmoid activation). A value near 1 in a time-frequency bin means "keep this component" (likely speech), while a value near 0 means "suppress this component" (likely noise).

Signal Approximation (SA)

The SpeechBrain implementation uses Signal Approximation (SA) rather than the more common Ideal Ratio Mask (IRM). In SA, the mask is applied to the log-compressed spectral magnitude and the loss is computed in the same domain:

SA Loss:  L = MSE(mask * log1p(|noisy_spec|^0.5), log1p(|clean_spec|^0.5))

This differs from IRM-based training where the mask is applied to the raw spectrogram and the loss may be computed in the waveform domain. SA has been shown to provide good performance with simpler training dynamics.

Waveform Mapping Approach

The alternative waveform mapping approach bypasses spectral decomposition entirely:

enhanced_wav = model(noisy_wav)
L = MSE(enhanced_wav, clean_wav)

This approach is conceptually simpler and avoids phase estimation issues, but requires the model to implicitly learn the spectral structure of speech. The SpeechBrain implementation supports switching between spectral mask and waveform targets via a single configuration flag:

waveform_target: False  # Set to True for waveform-domain loss

Loss Functions

Conventional enhancement training in SpeechBrain primarily uses:

  • MSE Loss (speechbrain.nnet.losses.mse_loss): Mean Squared Error between enhanced and clean spectrograms (or waveforms). This is the default and most stable choice.
  • STOI Loss (speechbrain.nnet.loss.stoi_loss.stoi_loss): A differentiable approximation of the Short-Time Objective Intelligibility metric. Can be used as an alternative training objective when waveform_target is True.

The MSE spectral loss provides smooth gradients and stable convergence. While it does not directly optimize perceptual quality metrics, it serves as a reliable proxy that generally improves PESQ and STOI scores.

Standard Brain.fit() Training Loop

Conventional enhancement training uses SpeechBrain's standard Brain.fit() training loop, which provides:

  1. Epoch iteration: Loop over training epochs with early stopping
  2. Batch processing: For each batch, call compute_forward() then compute_objectives()
  3. Gradient computation: Automatic backpropagation through the loss
  4. Optimizer step: Single optimizer updates model parameters
  5. Validation: Periodic evaluation on validation set with PESQ/STOI metrics
  6. Checkpointing: Save best model based on validation PESQ

This is significantly simpler than GAN-based training, which requires custom fit_batch() with dual optimizers and multiple sub-stages.

Validation Monitoring

During validation and testing, perceptual metrics are computed but not used as training objectives:

  • PESQ is computed using the external pesq package for monitoring
  • STOI is computed using SpeechBrain's differentiable STOI loss
  • The best checkpoint is selected based on validation PESQ (max_keys=["pesq"])

This means the model is optimized for spectral MSE but selected for perceptual quality, combining the stability of MSE training with perceptual quality-based model selection.

Key Design Decisions

  • Log-compressed spectral features: Using log1p(|STFT|^0.5) compresses the dynamic range of the spectrogram, making the MSE loss more perceptually relevant (since human hearing is approximately logarithmic in amplitude)
  • Sigmoid mask constraint: The [0, 1] mask constraint ensures the model can only attenuate spectral components, not amplify them. This prevents the model from introducing artifacts by boosting certain frequencies.
  • Phase reuse: The waveform is reconstructed by combining the enhanced magnitude with the original noisy phase. While this introduces some phase mismatch, it avoids the difficulty of phase estimation and works well in practice.
  • Ascending sort for efficiency: Training data is sorted by utterance length in ascending order, which minimizes padding waste in batched training.

Training Configuration

Typical hyperparameters for conventional enhancement training:

Parameter Value Rationale
Epochs 50 Converges faster than GAN-based (750 epochs)
Batch size 8 Larger batches are feasible without per-sample PESQ computation
Learning rate 0.0001 Standard Adam learning rate
FFT size 512 32 ms window at 16 kHz
Hop length 16 ms 50% overlap
Window Hamming Standard choice for speech processing
Loss MSE Spectral domain MSE (default)

Comparison with GAN-Based Training

Aspect Conventional GAN-Based (MetricGAN+)
Training stability High (convex-like loss landscape) Lower (adversarial dynamics)
Training speed Fast (50 epochs, large batches) Slow (750 epochs, batch_size=1)
Implementation complexity Low (standard Brain.fit) High (custom fit_batch, sub-stages)
PESQ optimization Indirect (via MSE proxy) Direct (discriminator predicts PESQ)
Peak PESQ performance Good Better (specifically optimized)
Debugging ease Easy (single loss, single optimizer) Hard (multi-stage, dual optimizer)

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment