Principle:Speechbrain Speechbrain Enhancement Architecture Selection

Property	Value
Principle Name	Enhancement_Architecture_Selection
Workflow	Speech_Enhancement_Training
Domains	Model_Architecture, Speech_Enhancement
Source Repository	speechbrain/speechbrain
Related Implementation	Implementation:Speechbrain_Speechbrain_Load_Hyperpyyaml_Enhancement

Overview

Enhancement Architecture Selection addresses the design decision of choosing and configuring the neural network architecture for a speech enhancement system. Different architectures offer distinct trade-offs in enhancement quality, computational cost, latency, and model size. SpeechBrain enables rapid experimentation across architectures through its HyperPyYAML configuration system, where the entire model architecture can be swapped by changing a single !include: directive in a YAML file.

Theoretical Background

Architecture Families for Speech Enhancement

Speech enhancement architectures can be broadly categorized into three families, each operating on different signal representations:

1. Spectral Mask Models

Spectral mask models operate in the Short-Time Fourier Transform (STFT) domain. They predict a multiplicative mask M(t,f) that is applied element-wise to the noisy spectrogram:

Enhanced_spec(t,f) = M(t,f) * Noisy_spec(t,f)

The mask values typically range from 0 to 1 (via Sigmoid activation), representing how much of each time-frequency bin to retain. This approach leverages the well-understood spectral structure of speech and noise. Two representative architectures are:

BLSTM (Bidirectional Long Short-Term Memory): Processes spectral frames sequentially, capturing temporal context in both directions. Uses 2 BLSTM layers with hidden size 200, followed by linear layers and a Sigmoid output.
2D-FCN (2D Fully Convolutional Network): Treats the spectrogram as a 2D image and applies 7 convolutional layers with 9x9 kernels and 64 channels, followed by a Sigmoid. Captures local time-frequency patterns through spatial convolutions.

2. Waveform Mapping Models

Waveform mapping models operate directly on the raw time-domain signal, bypassing the STFT entirely. They learn an end-to-end mapping:

clean_wav = f(noisy_wav)

This approach avoids phase estimation issues inherent in spectral methods, since the STFT phase is typically discarded or approximated. The trade-off is that the model must implicitly learn both spectral and temporal structure from raw samples.

3. GAN-Based Models

Generative Adversarial Network (GAN) models add an adversarial training objective on top of either spectral or waveform approaches:

MetricGAN / MetricGAN+: Uses a spectral mask generator paired with a discriminator that learns to predict perceptual quality scores (PESQ). The generator is trained to fool the discriminator into predicting high quality scores.
SEGAN (Speech Enhancement GAN): Uses an encoder-decoder architecture operating on raw waveforms, trained with both adversarial and reconstruction losses.

The Quality-Latency-Size Trade-off

Architecture	Quality	Latency	Model Size	Strengths
BLSTM	Good	Medium (requires bidirectional context)	Small	Strong temporal modeling
2D-FCN	Good	Low (convolutional, parallelizable)	Medium	Local T-F pattern capture
MetricGAN+	Best (PESQ-optimized)	High (GAN training)	Medium	Directly optimizes perceptual metric
SEGAN	Moderate	Low (fully convolutional)	Large	End-to-end waveform processing

Declarative Architecture Specification with HyperPyYAML

SpeechBrain uses HyperPyYAML as a declarative configuration language that extends standard YAML with Python object instantiation. This enables:

Architecture swapping: Change the model by modifying a single !include: line
Hyperparameter tuning: Override any parameter via command-line or YAML overrides
Compositional design: Models are composed from reusable building blocks (Sequential containers, RNN layers, CNN layers)

The key mechanism is the !include: directive in the training YAML:

# Change this import to use a different model
models: !include:models/BLSTM.yaml
    N_fft: !ref <N_fft>

Switching to a different architecture requires only changing the included file:

models: !include:models/2DFCN.yaml
    N_fft: !ref <N_fft>

Model Specification Patterns

Each model YAML file uses SpeechBrain's !new: and !name: constructors to define the architecture:

# BLSTM model specification
model: !new:speechbrain.nnet.containers.Sequential
    input_shape: [null, null, !ref <N_fft> // 2 + 1]
    lstm: !name:speechbrain.nnet.RNN.LSTM
        hidden_size: 200
        num_layers: 2
        bidirectional: True
    linear1: !name:speechbrain.nnet.linear.Linear
        n_neurons: 300
    act1: !new:torch.nn.LeakyReLU
    linear2: !name:speechbrain.nnet.linear.Linear
        n_neurons: !ref <N_fft> // 2 + 1
    act2: !new:torch.nn.Sigmoid

The !new: tag creates a new instance; !name: defers instantiation to the Sequential container which infers input shapes. The !ref tag enables cross-referencing parameters.

Key Design Decisions

Separation of model from training logic: The model architecture is defined in a separate YAML file from the training hyperparameters, enabling orthogonal experimentation
Sequential container pattern: Models are built as sequences of named layers inside speechbrain.nnet.containers.Sequential, which handles automatic shape inference
Consistent I/O contract: All spectral mask models accept a spectrogram tensor of shape [batch, time, freq] and output a mask of the same shape, regardless of internal architecture
Sigmoid output for masks: All spectral mask models use Sigmoid as the final activation, constraining mask values to [0, 1]

Relationship to Other Principles

The architecture selection principle connects to the broader training workflow:

Data Preparation provides the input/target pairs that the selected architecture will process
Architecture Selection (this principle) determines what model processes the data
Training Strategy (GAN or Conventional) determines how the model is optimized
Evaluation Metrics assess the quality of the trained model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment