Principle:Speechbrain Speechbrain Transducer ASR Training
| Knowledge Sources | |
|---|---|
| Domains | ASR, Transducer_Models, Streaming_ASR, Deep_Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Transducer-based automatic speech recognition models the joint probability of output token sequences conditioned on input acoustic sequences through a factorized architecture of encoder, prediction network, and joint network, enabling both streaming and offline recognition.
Description
The RNN-Transducer (RNN-T) and its modern variant the Conformer-Transducer extend the CTC framework by adding a prediction network (analogous to a language model) that conditions each output token on previously emitted tokens, not just the acoustic input. Unlike attention-based encoder-decoder models, the Transducer does not use a global attention mechanism over the entire encoder output. Instead, it combines encoder and prediction network outputs through a joint network at each (time, label) position in a two-dimensional lattice. This factored structure naturally supports streaming inference because the encoder can process audio incrementally without needing to attend to future frames. The transducer loss, computed via a forward-backward algorithm over the lattice, marginalizes over all valid monotonic alignments between input frames and output tokens.
Usage
Use this principle when building ASR systems that require streaming (real-time, low-latency) decoding, or when a monotonic alignment constraint between audio and text is desirable. Transducer models are the dominant architecture for on-device and production ASR systems because they naturally support chunk-wise or frame-by-frame processing. Choose this over attention-based Seq2Seq when latency constraints exist, and over pure CTC when output token dependencies need to be modeled.
Theoretical Basis
Three-Component Architecture
The Transducer decomposes the sequence-to-sequence mapping into three components:
1. Encoder Network: h_enc = Encoder(x_1, ..., x_T)
- Maps acoustic input to encoder representations
- Architecture: CRDNN, LSTM, Conformer, or wav2vec2-based
2. Prediction Network: h_pred = PredictionNet(y_0, y_1, ..., y_{u-1})
- Models output label history (analogous to a language model)
- Architecture: typically LSTM or embedding + LSTM layers
- Operates autoregressively on previously emitted non-blank tokens
3. Joint Network: z(t,u) = JointNet(h_enc_t, h_pred_u)
- Combines encoder and prediction network outputs at each lattice position
- Produces logits over vocabulary + blank token
- Typically: z(t,u) = Linear(tanh(Linear(h_enc_t) + Linear(h_pred_u)))
Transducer Lattice and Alignment
The Transducer defines a lattice of size (T+1) x (U+1), where T is the number of encoder frames and U is the number of output tokens. Each path through this lattice from (0,0) to (T,U) represents a valid alignment:
At lattice position (t, u):
- Emitting blank: move right to (t+1, u) -- consume an input frame
- Emitting token y_u: move up to (t, u+1) -- produce an output token
The probability of emitting symbol k at position (t,u):
p(k | t, u) = Softmax(z(t, u))_k
Transducer Loss
The transducer loss marginalizes over all valid paths through the lattice using a forward-backward algorithm:
p(y | x) = sum over all valid paths pi of: product of p(pi_step | t, u)
L_transducer = -log p(y | x)
The forward variable alpha(t, u) = probability of reaching (t, u) having emitted y_1..y_u
alpha(t, u) = alpha(t-1, u) * p(blank | t-1, u)
+ alpha(t, u-1) * p(y_u | t, u-1)
Base case: alpha(0, 0) = 1
Answer: p(y | x) = alpha(T, U)
This computation runs in O(T * U) time, analogous to the CTC forward-backward algorithm but extended to two dimensions.
Greedy and Beam Search Decoding
At inference, the Transducer supports multiple decoding strategies:
Greedy decoding:
For t = 1 to T:
While predicted token is not blank:
y_u = argmax p(k | t, u)
If y_u != blank: emit y_u, advance u
Advance t (next frame)
Beam search:
Maintain top-K hypotheses scored by:
Score(Y) = log p_transducer(Y | X) + beta * log p_LM(Y)