Principle:Speechbrain Speechbrain SepFormer Model Configuration
| Field | Value |
|---|---|
| Principle Name | SepFormer_Model_Configuration |
| Domain(s) | Model_Architecture, Speech_Separation |
| Description | Configuring dual-path transformer architectures for time-domain speech separation |
| Knowledge Sources | Subakan et al. 2021 "Attention is All You Need in Speech Separation" |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Load_Hyperpyyaml_SepFormer |
Overview
The SepFormer (Separation Transformer) is a dual-path transformer architecture for time-domain speech separation. It replaces the recurrent layers used in earlier dual-path models (e.g., DPRNN) with multi-head self-attention blocks, achieving state-of-the-art separation performance. Configuration of this architecture in SpeechBrain is managed entirely through HyperPyYAML configuration files, enabling reproducible and modular experimentation.
Theoretical Foundation
Time-Domain Separation Framework
SepFormer follows the encode-mask-decode paradigm:
- Encoder: A 1D convolutional layer converts the raw waveform into a latent representation
- Mask Network: A dual-path model predicts separation masks (one per source) in the latent space
- Decoder: A transposed 1D convolution reconstructs the separated waveforms from the masked latent representations
The key innovation is in the mask network, which uses Transformer blocks instead of RNNs for both local (intra-chunk) and global (inter-chunk) processing.
Dual-Path Processing
The latent representation from the encoder is segmented into overlapping chunks of size K. Processing alternates between:
- Intra-chunk (local) attention: Self-attention within each chunk captures fine-grained temporal patterns. Each chunk of K frames attends to all other frames in the same chunk.
- Inter-chunk (global) attention: Self-attention across chunks at the same position captures long-range dependencies. Each position across all chunks attends to the same position in every other chunk.
This dual-path design allows the model to handle long sequences efficiently. Instead of applying self-attention over the entire sequence (which scales quadratically), it operates on chunks of manageable size while still capturing global context.
Architecture Components
| Component | Class | Description |
|---|---|---|
| Encoder | speechbrain.lobes.models.dual_path.Encoder |
Conv1d layer: kernel_size=16, out_channels=256 |
| Intra Transformer | speechbrain.lobes.models.dual_path.SBTransformerBlock |
8 layers, 8 heads, d_model=256, d_ffn=1024, pre-norm, positional encoding |
| Inter Transformer | speechbrain.lobes.models.dual_path.SBTransformerBlock |
8 layers, 8 heads, d_model=256, d_ffn=1024, pre-norm, positional encoding |
| Dual Path Model | speechbrain.lobes.models.dual_path.Dual_Path_Model |
2 dual-path layers, K=250, layer norm, skip connections around intra |
| MaskNet | (integrated in Dual_Path_Model) | Predicts num_spks masks via PReLU + Conv1d |
| Decoder | speechbrain.lobes.models.dual_path.Decoder |
Transposed Conv1d: kernel_size=16, stride=8 |
Configuration via HyperPyYAML
SpeechBrain uses HyperPyYAML, an extension of standard YAML that supports object instantiation, cross-referencing, and arithmetic. The SepFormer configuration file defines all model components, training hyperparameters, and data paths in a single declarative file:
# Key architectural parameters
N_encoder_out: 256
kernel_size: 16
kernel_stride: 8
d_ffn: 1024
num_spks: 2
# Model component instantiation
Encoder: !new:speechbrain.lobes.models.dual_path.Encoder
kernel_size: !ref <kernel_size>
out_channels: !ref <N_encoder_out>
SBtfintra: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
num_layers: 8
d_model: !ref <out_channels>
nhead: 8
d_ffn: !ref <d_ffn>
MaskNet: !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
num_spks: !ref <num_spks>
in_channels: !ref <N_encoder_out>
out_channels: !ref <out_channels>
num_layers: 2
K: 250
intra_model: !ref <SBtfintra>
inter_model: !ref <SBtfinter>
Training Configuration
The YAML file also specifies the training regime:
- Optimizer: Adam with learning rate 0.00015 and no weight decay
- Loss function:
get_si_snr_with_pitwrapper(SI-SNR with Permutation Invariant Training) - Learning rate scheduler: ReduceLROnPlateau with factor 0.5 and patience 2
- Gradient clipping: Max gradient norm of 5
- Loss thresholding: Skip easy examples with loss above -30 dB
- Precision: Mixed precision (fp16) for faster training
Design Principles
- Modularity: Each component (encoder, transformer blocks, decoder) is independently configurable and replaceable
- Declarative specification: The YAML file fully describes the experiment, enabling reproducibility without code changes
- Cross-referencing: Parameters like
N_encoder_outare defined once and referenced throughout, preventing inconsistencies - Pretrained model support: The configuration supports loading pretrained weights via the
Pretrainerclass