Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain SepFormer Model Configuration

From Leeroopedia


Field Value
Principle Name SepFormer_Model_Configuration
Domain(s) Model_Architecture, Speech_Separation
Description Configuring dual-path transformer architectures for time-domain speech separation
Knowledge Sources Subakan et al. 2021 "Attention is All You Need in Speech Separation"
Related Implementation Implementation:Speechbrain_Speechbrain_Load_Hyperpyyaml_SepFormer

Overview

The SepFormer (Separation Transformer) is a dual-path transformer architecture for time-domain speech separation. It replaces the recurrent layers used in earlier dual-path models (e.g., DPRNN) with multi-head self-attention blocks, achieving state-of-the-art separation performance. Configuration of this architecture in SpeechBrain is managed entirely through HyperPyYAML configuration files, enabling reproducible and modular experimentation.

Theoretical Foundation

Time-Domain Separation Framework

SepFormer follows the encode-mask-decode paradigm:

  1. Encoder: A 1D convolutional layer converts the raw waveform into a latent representation
  2. Mask Network: A dual-path model predicts separation masks (one per source) in the latent space
  3. Decoder: A transposed 1D convolution reconstructs the separated waveforms from the masked latent representations

The key innovation is in the mask network, which uses Transformer blocks instead of RNNs for both local (intra-chunk) and global (inter-chunk) processing.

Dual-Path Processing

The latent representation from the encoder is segmented into overlapping chunks of size K. Processing alternates between:

  • Intra-chunk (local) attention: Self-attention within each chunk captures fine-grained temporal patterns. Each chunk of K frames attends to all other frames in the same chunk.
  • Inter-chunk (global) attention: Self-attention across chunks at the same position captures long-range dependencies. Each position across all chunks attends to the same position in every other chunk.

This dual-path design allows the model to handle long sequences efficiently. Instead of applying self-attention over the entire sequence (which scales quadratically), it operates on chunks of manageable size while still capturing global context.

Architecture Components

Component Class Description
Encoder speechbrain.lobes.models.dual_path.Encoder Conv1d layer: kernel_size=16, out_channels=256
Intra Transformer speechbrain.lobes.models.dual_path.SBTransformerBlock 8 layers, 8 heads, d_model=256, d_ffn=1024, pre-norm, positional encoding
Inter Transformer speechbrain.lobes.models.dual_path.SBTransformerBlock 8 layers, 8 heads, d_model=256, d_ffn=1024, pre-norm, positional encoding
Dual Path Model speechbrain.lobes.models.dual_path.Dual_Path_Model 2 dual-path layers, K=250, layer norm, skip connections around intra
MaskNet (integrated in Dual_Path_Model) Predicts num_spks masks via PReLU + Conv1d
Decoder speechbrain.lobes.models.dual_path.Decoder Transposed Conv1d: kernel_size=16, stride=8

Configuration via HyperPyYAML

SpeechBrain uses HyperPyYAML, an extension of standard YAML that supports object instantiation, cross-referencing, and arithmetic. The SepFormer configuration file defines all model components, training hyperparameters, and data paths in a single declarative file:

# Key architectural parameters
N_encoder_out: 256
kernel_size: 16
kernel_stride: 8
d_ffn: 1024
num_spks: 2

# Model component instantiation
Encoder: !new:speechbrain.lobes.models.dual_path.Encoder
    kernel_size: !ref <kernel_size>
    out_channels: !ref <N_encoder_out>

SBtfintra: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
    num_layers: 8
    d_model: !ref <out_channels>
    nhead: 8
    d_ffn: !ref <d_ffn>

MaskNet: !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
    num_spks: !ref <num_spks>
    in_channels: !ref <N_encoder_out>
    out_channels: !ref <out_channels>
    num_layers: 2
    K: 250
    intra_model: !ref <SBtfintra>
    inter_model: !ref <SBtfinter>

Training Configuration

The YAML file also specifies the training regime:

  • Optimizer: Adam with learning rate 0.00015 and no weight decay
  • Loss function: get_si_snr_with_pitwrapper (SI-SNR with Permutation Invariant Training)
  • Learning rate scheduler: ReduceLROnPlateau with factor 0.5 and patience 2
  • Gradient clipping: Max gradient norm of 5
  • Loss thresholding: Skip easy examples with loss above -30 dB
  • Precision: Mixed precision (fp16) for faster training

Design Principles

  • Modularity: Each component (encoder, transformer blocks, decoder) is independently configurable and replaceable
  • Declarative specification: The YAML file fully describes the experiment, enabling reproducibility without code changes
  • Cross-referencing: Parameters like N_encoder_out are defined once and referenced throughout, preventing inconsistencies
  • Pretrained model support: The configuration supports loading pretrained weights via the Pretrainer class

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment