Principle:Speechbrain Speechbrain Speech Translation Training

Knowledge Sources	SAMU: Semantic Alignment for Multilingual Understanding SpeechBrain
Domains	Speech_Translation, Multilingual_Processing
Last Updated	2026-02-09 00:00 GMT

Overview

End-to-end speech translation maps speech in a source language directly to text in a target language, bypassing intermediate transcription by jointly encoding acoustic and linguistic information.

Description

Speech translation (ST) is the task of converting spoken utterances in one language into written text in another language. Unlike cascade approaches that first transcribe speech to text (ASR) and then translate the text (MT), end-to-end ST models learn a direct mapping from source-language audio to target-language text. This avoids error propagation between pipeline stages and can better preserve prosodic and paralinguistic cues. The architecture typically combines a self-supervised speech encoder (such as wav2vec2) with a transformer decoder that generates target-language tokens via cross-attention over the encoded speech representations.

Usage

Apply this principle when building systems that translate spoken content across languages, particularly in low-resource settings where parallel speech-text data is scarce. End-to-end approaches are especially useful when cascade errors are unacceptable or when the source language lacks a robust ASR system.

Theoretical Basis

Architecture: Encoder-Decoder with Cross-Attention

The end-to-end ST model has two main components:

Source Audio (waveform)
  -> wav2vec2 Encoder (self-supervised, possibly frozen or fine-tuned)
    -> Dimensionality Reduction (linear projection)
      -> Transformer Decoder (cross-attention over encoder output)
        -> Linear + Softmax -> Target token probabilities

The wav2vec2 encoder processes raw waveforms into contextualized speech representations. A linear layer reduces dimensionality before the transformer decoder attends to these representations using cross-attention. The decoder autoregressively generates target-language tokens.

Training Objective

The model is trained with negative log-likelihood (NLL) loss on the target token sequence:

L_ST = -sum_{t=1}^{T} log p(y_t | y_{<t}, X_src)

where y_t are the target tokens and X_src is the source speech input. Teacher forcing is used during training, where the ground-truth previous tokens are provided as decoder input.

SAMU: Semantic Alignment for Multilingual Understanding

SAMU extends the basic ST approach by aligning speech and text representations in a shared semantic space. The core idea is to train a speech encoder such that its output representations are close to the sentence embeddings produced by a pretrained multilingual text encoder (such as LaBSE). This alignment enables zero-shot cross-lingual transfer:

L_SAMU = MSE(speech_encoder(audio), text_encoder(transcript))

By aligning the speech encoder to a multilingual text embedding space, SAMU enables the model to leverage linguistic knowledge from the text encoder without requiring parallel speech-translation pairs for every language pair.

Integration with Pretrained Language Models

For improved target-language generation, the transformer decoder can be initialized from a pretrained multilingual language model such as mBART. This provides the decoder with strong target-language priors, reducing the amount of parallel ST data needed. The full architecture becomes:

wav2vec2 Encoder -> Linear Projection -> mBART Decoder -> Target Text

Decoding

During inference, beam search is used to find the most likely target token sequence. SentencePiece tokenization handles subword segmentation, and Moses detokenization converts the output tokens back into readable target-language text.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment