Principle:Speechbrain Speechbrain Transducer ASR Training

Knowledge Sources	Graves 2012 "Sequence Transduction with Recurrent Neural Networks" He et al. 2019 "Streaming End-to-End Speech Recognition for Mobile Devices" Gulati et al. 2020 "Conformer: Convolution-augmented Transformer for Speech Recognition" SpeechBrain
Domains	ASR, Transducer_Models, Streaming_ASR, Deep_Learning
Last Updated	2026-02-09 00:00 GMT

Overview

Transducer-based automatic speech recognition models the joint probability of output token sequences conditioned on input acoustic sequences through a factorized architecture of encoder, prediction network, and joint network, enabling both streaming and offline recognition.

Description

The RNN-Transducer (RNN-T) and its modern variant the Conformer-Transducer extend the CTC framework by adding a prediction network (analogous to a language model) that conditions each output token on previously emitted tokens, not just the acoustic input. Unlike attention-based encoder-decoder models, the Transducer does not use a global attention mechanism over the entire encoder output. Instead, it combines encoder and prediction network outputs through a joint network at each (time, label) position in a two-dimensional lattice. This factored structure naturally supports streaming inference because the encoder can process audio incrementally without needing to attend to future frames. The transducer loss, computed via a forward-backward algorithm over the lattice, marginalizes over all valid monotonic alignments between input frames and output tokens.

Usage

Use this principle when building ASR systems that require streaming (real-time, low-latency) decoding, or when a monotonic alignment constraint between audio and text is desirable. Transducer models are the dominant architecture for on-device and production ASR systems because they naturally support chunk-wise or frame-by-frame processing. Choose this over attention-based Seq2Seq when latency constraints exist, and over pure CTC when output token dependencies need to be modeled.

Theoretical Basis

Three-Component Architecture

The Transducer decomposes the sequence-to-sequence mapping into three components:

1. Encoder Network:    h_enc = Encoder(x_1, ..., x_T)
   - Maps acoustic input to encoder representations
   - Architecture: CRDNN, LSTM, Conformer, or wav2vec2-based

2. Prediction Network: h_pred = PredictionNet(y_0, y_1, ..., y_{u-1})
   - Models output label history (analogous to a language model)
   - Architecture: typically LSTM or embedding + LSTM layers
   - Operates autoregressively on previously emitted non-blank tokens

3. Joint Network:      z(t,u) = JointNet(h_enc_t, h_pred_u)
   - Combines encoder and prediction network outputs at each lattice position
   - Produces logits over vocabulary + blank token
   - Typically: z(t,u) = Linear(tanh(Linear(h_enc_t) + Linear(h_pred_u)))

Transducer Lattice and Alignment

The Transducer defines a lattice of size (T+1) x (U+1), where T is the number of encoder frames and U is the number of output tokens. Each path through this lattice from (0,0) to (T,U) represents a valid alignment:

At lattice position (t, u):
  - Emitting blank:     move right to (t+1, u)   -- consume an input frame
  - Emitting token y_u: move up to (t, u+1)      -- produce an output token

The probability of emitting symbol k at position (t,u):
  p(k | t, u) = Softmax(z(t, u))_k

Transducer Loss

The transducer loss marginalizes over all valid paths through the lattice using a forward-backward algorithm:

p(y | x) = sum over all valid paths pi of: product of p(pi_step | t, u)

L_transducer = -log p(y | x)

The forward variable alpha(t, u) = probability of reaching (t, u) having emitted y_1..y_u
  alpha(t, u) = alpha(t-1, u) * p(blank | t-1, u)
              + alpha(t, u-1) * p(y_u | t, u-1)

Base case: alpha(0, 0) = 1
Answer:    p(y | x) = alpha(T, U)

This computation runs in O(T * U) time, analogous to the CTC forward-backward algorithm but extended to two dimensions.

Greedy and Beam Search Decoding

At inference, the Transducer supports multiple decoding strategies:

Greedy decoding:
  For t = 1 to T:
    While predicted token is not blank:
      y_u = argmax p(k | t, u)
      If y_u != blank: emit y_u, advance u
    Advance t (next frame)

Beam search:
  Maintain top-K hypotheses scored by:
  Score(Y) = log p_transducer(Y | X) + beta * log p_LM(Y)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment