Principle:Speechbrain Speechbrain Transformer ASR Training

Knowledge Sources	Vaswani et al. 2017 "Attention Is All You Need" Gulati et al. 2020 "Conformer: Convolution-augmented Transformer for Speech Recognition" Karita et al. 2019 "A Comparative Study on Transformer vs RNN in Speech Applications" SpeechBrain
Domains	ASR, Transformer_Architecture, Deep_Learning
Last Updated	2026-02-09 00:00 GMT

Overview

Transformer-based automatic speech recognition replaces recurrent encoder-decoder architectures with self-attention mechanisms that enable parallel computation over the full input sequence, achieving strong performance through multi-head attention, positional encoding, and optional convolution augmentation.

Description

The Transformer architecture, originally developed for natural language processing, has been adapted for speech recognition by combining CNN-based front-end feature extraction with stacked self-attention encoder and decoder layers. Unlike RNN-based Seq2Seq models that process input sequentially, Transformers compute attention over all positions simultaneously, enabling efficient parallel training on GPUs. The Conformer variant augments Transformer blocks with depthwise separable convolution modules, capturing both global dependencies through self-attention and local patterns through convolution. Training uses a joint CTC + attention loss identical to Seq2Seq ASR, but the encoder and decoder are both built from Transformer layers rather than recurrent networks. Positional encoding (sinusoidal or learned) is added to preserve sequence ordering information that the permutation-invariant self-attention mechanism would otherwise discard.

Usage

Use this principle when building high-accuracy ASR systems where training throughput and parallelism are important. Transformer and Conformer encoders typically outperform recurrent encoders on large-scale datasets due to their ability to model long-range dependencies without the vanishing gradient problem. This approach is particularly effective when combined with pretrained representations from wav2vec 2.0 or similar self-supervised models. Choose this over Seq2Seq RNN when sufficient GPU memory and training data are available.

Theoretical Basis

CNN Front-End Feature Extraction

Raw audio or filterbank features are first processed by a CNN front-end that performs subsampling:

x_sub = CNN_Frontend(x)

where:
  x     = (x_1, ..., x_T)     -- input feature frames (e.g., 80-dim log-mel filterbanks)
  x_sub = (x_1, ..., x_{T/4}) -- subsampled by factor of 4 (two conv layers with stride 2)

The CNN front-end reduces the sequence length, making self-attention computationally tractable for long audio inputs.

Positional Encoding

Since self-attention is permutation-invariant, positional information must be injected explicitly:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

x_pos = x_sub + PE

where pos is the position index and i is the dimension index within the model dimension d_model.

Multi-Head Self-Attention

Each Transformer layer applies multi-head self-attention followed by a feed-forward network:

MultiHead(Q, K, V) = Concat(head_1, ..., head_H) * W_O

where each head_i:
  head_i = Attention(Q * W_Q_i, K * W_K_i, V * W_V_i)

Attention(Q, K, V) = Softmax(Q * K^T / sqrt(d_k)) * V

The scaling factor sqrt(d_k) prevents the dot products from growing too large, which would push the softmax into regions of very small gradients.

Conformer Block

The Conformer variant modifies the standard Transformer block by sandwiching self-attention between two feed-forward modules and adding a convolution module:

Conformer Block:
  1. x = x + 0.5 * FFN_1(x)              -- first half-step feed-forward
  2. x = x + MultiHeadSelfAttention(x)    -- self-attention module
  3. x = x + ConvModule(x)                -- depthwise separable convolution
  4. x = x + 0.5 * FFN_2(x)              -- second half-step feed-forward
  5. x = LayerNorm(x)

The convolution module captures local acoustic patterns (formant transitions, phoneme boundaries) that self-attention alone may underweight due to its global receptive field.

Joint CTC-Attention Training

Training follows the same joint loss formulation as Seq2Seq ASR:

L = alpha * L_CTC + (1 - alpha) * L_NLL

where:
  L_CTC applies to the encoder output via a linear + log-softmax projection
  L_NLL applies to the decoder output using cross-entropy with label smoothing
  alpha is typically 0.2-0.3 for Transformer models

Label smoothing is commonly applied to the NLL loss to prevent the model from becoming overconfident, distributing a small probability mass uniformly across all tokens.

Learning Rate Schedule

Transformer ASR training typically uses the Noam learning rate schedule:

lr(step) = d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5))

This provides a linear warmup phase followed by inverse-square-root decay, which is critical for stable training of deep Transformer models.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment