Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain Enhancement Architecture Selection

From Leeroopedia


Property Value
Principle Name Enhancement_Architecture_Selection
Workflow Speech_Enhancement_Training
Domains Model_Architecture, Speech_Enhancement
Source Repository speechbrain/speechbrain
Related Implementation Implementation:Speechbrain_Speechbrain_Load_Hyperpyyaml_Enhancement

Overview

Enhancement Architecture Selection addresses the design decision of choosing and configuring the neural network architecture for a speech enhancement system. Different architectures offer distinct trade-offs in enhancement quality, computational cost, latency, and model size. SpeechBrain enables rapid experimentation across architectures through its HyperPyYAML configuration system, where the entire model architecture can be swapped by changing a single !include: directive in a YAML file.

Theoretical Background

Architecture Families for Speech Enhancement

Speech enhancement architectures can be broadly categorized into three families, each operating on different signal representations:

1. Spectral Mask Models

Spectral mask models operate in the Short-Time Fourier Transform (STFT) domain. They predict a multiplicative mask M(t,f) that is applied element-wise to the noisy spectrogram:

Enhanced_spec(t,f) = M(t,f) * Noisy_spec(t,f)

The mask values typically range from 0 to 1 (via Sigmoid activation), representing how much of each time-frequency bin to retain. This approach leverages the well-understood spectral structure of speech and noise. Two representative architectures are:

  • BLSTM (Bidirectional Long Short-Term Memory): Processes spectral frames sequentially, capturing temporal context in both directions. Uses 2 BLSTM layers with hidden size 200, followed by linear layers and a Sigmoid output.
  • 2D-FCN (2D Fully Convolutional Network): Treats the spectrogram as a 2D image and applies 7 convolutional layers with 9x9 kernels and 64 channels, followed by a Sigmoid. Captures local time-frequency patterns through spatial convolutions.

2. Waveform Mapping Models

Waveform mapping models operate directly on the raw time-domain signal, bypassing the STFT entirely. They learn an end-to-end mapping:

clean_wav = f(noisy_wav)

This approach avoids phase estimation issues inherent in spectral methods, since the STFT phase is typically discarded or approximated. The trade-off is that the model must implicitly learn both spectral and temporal structure from raw samples.

3. GAN-Based Models

Generative Adversarial Network (GAN) models add an adversarial training objective on top of either spectral or waveform approaches:

  • MetricGAN / MetricGAN+: Uses a spectral mask generator paired with a discriminator that learns to predict perceptual quality scores (PESQ). The generator is trained to fool the discriminator into predicting high quality scores.
  • SEGAN (Speech Enhancement GAN): Uses an encoder-decoder architecture operating on raw waveforms, trained with both adversarial and reconstruction losses.

The Quality-Latency-Size Trade-off

Architecture Quality Latency Model Size Strengths
BLSTM Good Medium (requires bidirectional context) Small Strong temporal modeling
2D-FCN Good Low (convolutional, parallelizable) Medium Local T-F pattern capture
MetricGAN+ Best (PESQ-optimized) High (GAN training) Medium Directly optimizes perceptual metric
SEGAN Moderate Low (fully convolutional) Large End-to-end waveform processing

Declarative Architecture Specification with HyperPyYAML

SpeechBrain uses HyperPyYAML as a declarative configuration language that extends standard YAML with Python object instantiation. This enables:

  • Architecture swapping: Change the model by modifying a single !include: line
  • Hyperparameter tuning: Override any parameter via command-line or YAML overrides
  • Compositional design: Models are composed from reusable building blocks (Sequential containers, RNN layers, CNN layers)

The key mechanism is the !include: directive in the training YAML:

# Change this import to use a different model
models: !include:models/BLSTM.yaml
    N_fft: !ref <N_fft>

Switching to a different architecture requires only changing the included file:

models: !include:models/2DFCN.yaml
    N_fft: !ref <N_fft>

Model Specification Patterns

Each model YAML file uses SpeechBrain's !new: and !name: constructors to define the architecture:

# BLSTM model specification
model: !new:speechbrain.nnet.containers.Sequential
    input_shape: [null, null, !ref <N_fft> // 2 + 1]
    lstm: !name:speechbrain.nnet.RNN.LSTM
        hidden_size: 200
        num_layers: 2
        bidirectional: True
    linear1: !name:speechbrain.nnet.linear.Linear
        n_neurons: 300
    act1: !new:torch.nn.LeakyReLU
    linear2: !name:speechbrain.nnet.linear.Linear
        n_neurons: !ref <N_fft> // 2 + 1
    act2: !new:torch.nn.Sigmoid

The !new: tag creates a new instance; !name: defers instantiation to the Sequential container which infers input shapes. The !ref tag enables cross-referencing parameters.

Key Design Decisions

  • Separation of model from training logic: The model architecture is defined in a separate YAML file from the training hyperparameters, enabling orthogonal experimentation
  • Sequential container pattern: Models are built as sequences of named layers inside speechbrain.nnet.containers.Sequential, which handles automatic shape inference
  • Consistent I/O contract: All spectral mask models accept a spectrogram tensor of shape [batch, time, freq] and output a mask of the same shape, regardless of internal architecture
  • Sigmoid output for masks: All spectral mask models use Sigmoid as the final activation, constraining mask values to [0, 1]

Relationship to Other Principles

The architecture selection principle connects to the broader training workflow:

  1. Data Preparation provides the input/target pairs that the selected architecture will process
  2. Architecture Selection (this principle) determines what model processes the data
  3. Training Strategy (GAN or Conventional) determines how the model is optimized
  4. Evaluation Metrics assess the quality of the trained model

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment