Principle:Speechbrain Speechbrain SepFormer Model Configuration

Field	Value
Principle Name	SepFormer_Model_Configuration
Domain(s)	Model_Architecture, Speech_Separation
Description	Configuring dual-path transformer architectures for time-domain speech separation
Knowledge Sources	Subakan et al. 2021 "Attention is All You Need in Speech Separation"
Related Implementation	Implementation:Speechbrain_Speechbrain_Load_Hyperpyyaml_SepFormer

Overview

The SepFormer (Separation Transformer) is a dual-path transformer architecture for time-domain speech separation. It replaces the recurrent layers used in earlier dual-path models (e.g., DPRNN) with multi-head self-attention blocks, achieving state-of-the-art separation performance. Configuration of this architecture in SpeechBrain is managed entirely through HyperPyYAML configuration files, enabling reproducible and modular experimentation.

Theoretical Foundation

Time-Domain Separation Framework

SepFormer follows the encode-mask-decode paradigm:

Encoder: A 1D convolutional layer converts the raw waveform into a latent representation
Mask Network: A dual-path model predicts separation masks (one per source) in the latent space
Decoder: A transposed 1D convolution reconstructs the separated waveforms from the masked latent representations

The key innovation is in the mask network, which uses Transformer blocks instead of RNNs for both local (intra-chunk) and global (inter-chunk) processing.

Dual-Path Processing

The latent representation from the encoder is segmented into overlapping chunks of size K. Processing alternates between:

Intra-chunk (local) attention: Self-attention within each chunk captures fine-grained temporal patterns. Each chunk of K frames attends to all other frames in the same chunk.
Inter-chunk (global) attention: Self-attention across chunks at the same position captures long-range dependencies. Each position across all chunks attends to the same position in every other chunk.

This dual-path design allows the model to handle long sequences efficiently. Instead of applying self-attention over the entire sequence (which scales quadratically), it operates on chunks of manageable size while still capturing global context.

Architecture Components

Component	Class	Description
Encoder	`speechbrain.lobes.models.dual_path.Encoder`	Conv1d layer: kernel_size=16, out_channels=256
Intra Transformer	`speechbrain.lobes.models.dual_path.SBTransformerBlock`	8 layers, 8 heads, d_model=256, d_ffn=1024, pre-norm, positional encoding
Inter Transformer	`speechbrain.lobes.models.dual_path.SBTransformerBlock`	8 layers, 8 heads, d_model=256, d_ffn=1024, pre-norm, positional encoding
Dual Path Model	`speechbrain.lobes.models.dual_path.Dual_Path_Model`	2 dual-path layers, K=250, layer norm, skip connections around intra
MaskNet	(integrated in Dual_Path_Model)	Predicts num_spks masks via PReLU + Conv1d
Decoder	`speechbrain.lobes.models.dual_path.Decoder`	Transposed Conv1d: kernel_size=16, stride=8

Configuration via HyperPyYAML

SpeechBrain uses HyperPyYAML, an extension of standard YAML that supports object instantiation, cross-referencing, and arithmetic. The SepFormer configuration file defines all model components, training hyperparameters, and data paths in a single declarative file:

# Key architectural parameters
N_encoder_out: 256
kernel_size: 16
kernel_stride: 8
d_ffn: 1024
num_spks: 2

# Model component instantiation
Encoder: !new:speechbrain.lobes.models.dual_path.Encoder
    kernel_size: !ref <kernel_size>
    out_channels: !ref <N_encoder_out>

SBtfintra: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
    num_layers: 8
    d_model: !ref <out_channels>
    nhead: 8
    d_ffn: !ref <d_ffn>

MaskNet: !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
    num_spks: !ref <num_spks>
    in_channels: !ref <N_encoder_out>
    out_channels: !ref <out_channels>
    num_layers: 2
    K: 250
    intra_model: !ref <SBtfintra>
    inter_model: !ref <SBtfinter>

Training Configuration

The YAML file also specifies the training regime:

Optimizer: Adam with learning rate 0.00015 and no weight decay
Loss function: get_si_snr_with_pitwrapper (SI-SNR with Permutation Invariant Training)
Learning rate scheduler: ReduceLROnPlateau with factor 0.5 and patience 2
Gradient clipping: Max gradient norm of 5
Loss thresholding: Skip easy examples with loss above -30 dB
Precision: Mixed precision (fp16) for faster training

Design Principles

Modularity: Each component (encoder, transformer blocks, decoder) is independently configurable and replaceable
Declarative specification: The YAML file fully describes the experiment, enabling reproducibility without code changes
Cross-referencing: Parameters like N_encoder_out are defined once and referenced throughout, preventing inconsistencies
Pretrained model support: The configuration supports loading pretrained weights via the Pretrainer class

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment