Principle:Speechbrain Speechbrain Enhancement Architecture Selection
| Property | Value |
|---|---|
| Principle Name | Enhancement_Architecture_Selection |
| Workflow | Speech_Enhancement_Training |
| Domains | Model_Architecture, Speech_Enhancement |
| Source Repository | speechbrain/speechbrain |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Load_Hyperpyyaml_Enhancement |
Overview
Enhancement Architecture Selection addresses the design decision of choosing and configuring the neural network architecture for a speech enhancement system. Different architectures offer distinct trade-offs in enhancement quality, computational cost, latency, and model size. SpeechBrain enables rapid experimentation across architectures through its HyperPyYAML configuration system, where the entire model architecture can be swapped by changing a single !include: directive in a YAML file.
Theoretical Background
Architecture Families for Speech Enhancement
Speech enhancement architectures can be broadly categorized into three families, each operating on different signal representations:
1. Spectral Mask Models
Spectral mask models operate in the Short-Time Fourier Transform (STFT) domain. They predict a multiplicative mask M(t,f) that is applied element-wise to the noisy spectrogram:
Enhanced_spec(t,f) = M(t,f) * Noisy_spec(t,f)
The mask values typically range from 0 to 1 (via Sigmoid activation), representing how much of each time-frequency bin to retain. This approach leverages the well-understood spectral structure of speech and noise. Two representative architectures are:
- BLSTM (Bidirectional Long Short-Term Memory): Processes spectral frames sequentially, capturing temporal context in both directions. Uses 2 BLSTM layers with hidden size 200, followed by linear layers and a Sigmoid output.
- 2D-FCN (2D Fully Convolutional Network): Treats the spectrogram as a 2D image and applies 7 convolutional layers with 9x9 kernels and 64 channels, followed by a Sigmoid. Captures local time-frequency patterns through spatial convolutions.
2. Waveform Mapping Models
Waveform mapping models operate directly on the raw time-domain signal, bypassing the STFT entirely. They learn an end-to-end mapping:
clean_wav = f(noisy_wav)
This approach avoids phase estimation issues inherent in spectral methods, since the STFT phase is typically discarded or approximated. The trade-off is that the model must implicitly learn both spectral and temporal structure from raw samples.
3. GAN-Based Models
Generative Adversarial Network (GAN) models add an adversarial training objective on top of either spectral or waveform approaches:
- MetricGAN / MetricGAN+: Uses a spectral mask generator paired with a discriminator that learns to predict perceptual quality scores (PESQ). The generator is trained to fool the discriminator into predicting high quality scores.
- SEGAN (Speech Enhancement GAN): Uses an encoder-decoder architecture operating on raw waveforms, trained with both adversarial and reconstruction losses.
The Quality-Latency-Size Trade-off
| Architecture | Quality | Latency | Model Size | Strengths |
|---|---|---|---|---|
| BLSTM | Good | Medium (requires bidirectional context) | Small | Strong temporal modeling |
| 2D-FCN | Good | Low (convolutional, parallelizable) | Medium | Local T-F pattern capture |
| MetricGAN+ | Best (PESQ-optimized) | High (GAN training) | Medium | Directly optimizes perceptual metric |
| SEGAN | Moderate | Low (fully convolutional) | Large | End-to-end waveform processing |
Declarative Architecture Specification with HyperPyYAML
SpeechBrain uses HyperPyYAML as a declarative configuration language that extends standard YAML with Python object instantiation. This enables:
- Architecture swapping: Change the model by modifying a single
!include:line - Hyperparameter tuning: Override any parameter via command-line or YAML overrides
- Compositional design: Models are composed from reusable building blocks (Sequential containers, RNN layers, CNN layers)
The key mechanism is the !include: directive in the training YAML:
# Change this import to use a different model
models: !include:models/BLSTM.yaml
N_fft: !ref <N_fft>
Switching to a different architecture requires only changing the included file:
models: !include:models/2DFCN.yaml
N_fft: !ref <N_fft>
Model Specification Patterns
Each model YAML file uses SpeechBrain's !new: and !name: constructors to define the architecture:
# BLSTM model specification
model: !new:speechbrain.nnet.containers.Sequential
input_shape: [null, null, !ref <N_fft> // 2 + 1]
lstm: !name:speechbrain.nnet.RNN.LSTM
hidden_size: 200
num_layers: 2
bidirectional: True
linear1: !name:speechbrain.nnet.linear.Linear
n_neurons: 300
act1: !new:torch.nn.LeakyReLU
linear2: !name:speechbrain.nnet.linear.Linear
n_neurons: !ref <N_fft> // 2 + 1
act2: !new:torch.nn.Sigmoid
The !new: tag creates a new instance; !name: defers instantiation to the Sequential container which infers input shapes. The !ref tag enables cross-referencing parameters.
Key Design Decisions
- Separation of model from training logic: The model architecture is defined in a separate YAML file from the training hyperparameters, enabling orthogonal experimentation
- Sequential container pattern: Models are built as sequences of named layers inside
speechbrain.nnet.containers.Sequential, which handles automatic shape inference - Consistent I/O contract: All spectral mask models accept a spectrogram tensor of shape
[batch, time, freq]and output a mask of the same shape, regardless of internal architecture - Sigmoid output for masks: All spectral mask models use Sigmoid as the final activation, constraining mask values to [0, 1]
Relationship to Other Principles
The architecture selection principle connects to the broader training workflow:
- Data Preparation provides the input/target pairs that the selected architecture will process
- Architecture Selection (this principle) determines what model processes the data
- Training Strategy (GAN or Conventional) determines how the model is optimized
- Evaluation Metrics assess the quality of the trained model
See Also
- Implementation:Speechbrain_Speechbrain_Load_Hyperpyyaml_Enhancement -- The concrete mechanism for loading and instantiating architectures
- Principle:Speechbrain_Speechbrain_GAN_Based_Enhancement_Training -- GAN-specific architecture considerations (generator + discriminator)
- Principle:Speechbrain_Speechbrain_Conventional_Enhancement_Training -- How spectral mask and waveform models are trained conventionally