Principle:Speechbrain Speechbrain Seq2Seq ASR Training
| Knowledge Sources | |
|---|---|
| Domains | ASR, Sequence_to_Sequence, Deep_Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Attention-based encoder-decoder automatic speech recognition jointly trains a sequence-to-sequence model using both CTC and negative log-likelihood losses to map variable-length acoustic feature sequences to token sequences without requiring explicit alignment.
Description
Sequence-to-sequence (Seq2Seq) ASR follows the encoder-decoder paradigm originally proposed for machine translation: an encoder compresses the input speech signal into a sequence of hidden representations, and an autoregressive decoder generates output tokens one at a time, attending to the encoder outputs through an attention mechanism. The key challenge in speech is the large length ratio between input frames and output tokens, which makes attention alignment difficult. To address this, modern Seq2Seq ASR systems employ a joint CTC + attention objective: the CTC loss provides a monotonic alignment prior that regularizes the attention mechanism, while the attention-based negative log-likelihood (NLL) loss enables the decoder to model rich output dependencies. Encoder architectures range from CRDNN (convolutional-recurrent-DNN) to pretrained self-supervised models such as wav2vec 2.0, paired with GRU or LSTM decoders.
Usage
Use this principle when building an end-to-end ASR system that needs to generate token sequences autoregressively, particularly when language model integration via shallow fusion is desired at inference time. This approach is preferred over pure CTC when output token dependencies matter (e.g., word-piece or character models where the decoder can learn spelling patterns) and over Transducer models when streaming is not a requirement.
Theoretical Basis
Encoder-Decoder Architecture
The encoder transforms input features into a high-level representation:
h = Encoder(x)
where:
x = (x_1, ..., x_T) -- input acoustic frames (MFCC, filterbank, or wav2vec2 embeddings)
h = (h_1, ..., h_T') -- encoder output sequence (T' <= T after subsampling)
The decoder generates tokens autoregressively using attention:
For each output step u = 1, ..., U:
c_u = Attention(s_{u-1}, h) -- context vector from attending to encoder outputs
s_u = DecoderRNN(s_{u-1}, y_{u-1}, c_u) -- decoder hidden state update
p(y_u | y_{<u}, x) = Softmax(Linear(s_u)) -- output distribution over vocabulary
Joint CTC-Attention Loss
The total training objective combines CTC and attention-based NLL losses:
L = alpha * L_CTC + (1 - alpha) * L_NLL
where:
L_CTC = -log p_CTC(y | x) -- CTC loss over all valid alignments
L_NLL = -sum_{u=1}^{U} log p(y_u | y_{<u}, x) -- cross-entropy at each decoder step
alpha in [0, 1] -- interpolation weight (typically 0.2-0.5)
The CTC branch operates on the encoder output directly through a linear projection and log-softmax, while the NLL branch operates through the full encoder-decoder pathway.
Teacher Forcing and Scheduled Sampling
During training, the decoder is typically trained with teacher forcing: the ground-truth previous token y_{u-1} is fed as input at each step rather than the model's own prediction. Some recipes employ scheduled sampling where the model's own predictions are used with increasing probability as training progresses, reducing the train-test mismatch.
Beam Search Decoding
At inference, beam search expands the top-K hypotheses at each decoding step:
Score(Y) = (1 - lambda) * log p_attn(Y | X) + lambda * log p_CTC(Y | X) + beta * log p_LM(Y)
where:
lambda -- CTC weight during decoding
beta -- language model weight for shallow fusion
p_LM -- external language model probability