Principle:Speechbrain Speechbrain Seq2Seq ASR Training

Knowledge Sources	Chan et al. 2016 "Listen, Attend and Spell" Kim et al. 2017 "Joint CTC-Attention Based End-to-End Speech Recognition Using Multi-Task Learning" SpeechBrain
Domains	ASR, Sequence_to_Sequence, Deep_Learning
Last Updated	2026-02-09 00:00 GMT

Overview

Attention-based encoder-decoder automatic speech recognition jointly trains a sequence-to-sequence model using both CTC and negative log-likelihood losses to map variable-length acoustic feature sequences to token sequences without requiring explicit alignment.

Description

Sequence-to-sequence (Seq2Seq) ASR follows the encoder-decoder paradigm originally proposed for machine translation: an encoder compresses the input speech signal into a sequence of hidden representations, and an autoregressive decoder generates output tokens one at a time, attending to the encoder outputs through an attention mechanism. The key challenge in speech is the large length ratio between input frames and output tokens, which makes attention alignment difficult. To address this, modern Seq2Seq ASR systems employ a joint CTC + attention objective: the CTC loss provides a monotonic alignment prior that regularizes the attention mechanism, while the attention-based negative log-likelihood (NLL) loss enables the decoder to model rich output dependencies. Encoder architectures range from CRDNN (convolutional-recurrent-DNN) to pretrained self-supervised models such as wav2vec 2.0, paired with GRU or LSTM decoders.

Usage

Use this principle when building an end-to-end ASR system that needs to generate token sequences autoregressively, particularly when language model integration via shallow fusion is desired at inference time. This approach is preferred over pure CTC when output token dependencies matter (e.g., word-piece or character models where the decoder can learn spelling patterns) and over Transducer models when streaming is not a requirement.

Theoretical Basis

Encoder-Decoder Architecture

The encoder transforms input features into a high-level representation:

h = Encoder(x)

where:
  x = (x_1, ..., x_T)  -- input acoustic frames (MFCC, filterbank, or wav2vec2 embeddings)
  h = (h_1, ..., h_T')  -- encoder output sequence (T' <= T after subsampling)

The decoder generates tokens autoregressively using attention:

For each output step u = 1, ..., U:
  c_u   = Attention(s_{u-1}, h)         -- context vector from attending to encoder outputs
  s_u   = DecoderRNN(s_{u-1}, y_{u-1}, c_u)  -- decoder hidden state update
  p(y_u | y_{<u}, x) = Softmax(Linear(s_u))  -- output distribution over vocabulary

Joint CTC-Attention Loss

The total training objective combines CTC and attention-based NLL losses:

L = alpha * L_CTC + (1 - alpha) * L_NLL

where:
  L_CTC = -log p_CTC(y | x)          -- CTC loss over all valid alignments
  L_NLL = -sum_{u=1}^{U} log p(y_u | y_{<u}, x)  -- cross-entropy at each decoder step
  alpha in [0, 1]                     -- interpolation weight (typically 0.2-0.5)

The CTC branch operates on the encoder output directly through a linear projection and log-softmax, while the NLL branch operates through the full encoder-decoder pathway.

Teacher Forcing and Scheduled Sampling

During training, the decoder is typically trained with teacher forcing: the ground-truth previous token y_{u-1} is fed as input at each step rather than the model's own prediction. Some recipes employ scheduled sampling where the model's own predictions are used with increasing probability as training progresses, reducing the train-test mismatch.

Beam Search Decoding

At inference, beam search expands the top-K hypotheses at each decoding step:

Score(Y) = (1 - lambda) * log p_attn(Y | X) + lambda * log p_CTC(Y | X) + beta * log p_LM(Y)

where:
  lambda  -- CTC weight during decoding
  beta    -- language model weight for shallow fusion
  p_LM    -- external language model probability

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment