Principle:Speechbrain Speechbrain Speech To Unit Translation

Knowledge Sources	Textless Speech-to-Speech Translation on Real Data (Lee et al.) Direct Speech-to-Speech Translation with Discrete Units SpeechBrain
Domains	Speech_Translation, Speech_Synthesis
Last Updated	2026-02-09 00:00 GMT

Overview

Speech-to-unit translation converts speech in a source language into a sequence of discrete acoustic units in a target language, enabling textless speech-to-speech translation through learned codebook representations.

Description

Speech-to-unit translation (S2UT) is a framework for direct speech-to-speech translation that eliminates the need for text as an intermediate representation. Instead of translating speech to text and then synthesizing speech from text, S2UT maps source-language speech directly to a sequence of discrete acoustic units in the target language. These discrete units are derived by applying k-means clustering to the hidden representations of a self-supervised speech model (such as HuBERT or wav2vec2), creating a finite codebook that captures phonetic and prosodic information. A unit-based vocoder (such as Unit HiFi-GAN) then converts the predicted unit sequence back into a waveform.

Usage

Apply this principle when building speech-to-speech translation systems that must operate without text transcriptions, particularly for unwritten languages or when preserving acoustic characteristics is important. S2UT is also useful when cascade ASR-MT-TTS pipelines introduce unacceptable latency or error propagation.

Theoretical Basis

Discrete Unit Extraction

The first step is to learn a discrete codebook from target-language speech:

Target Speech (waveform)
  -> Self-Supervised Model (e.g., HuBERT layer 6)
    -> Hidden Representations h_1, h_2, ..., h_T
      -> K-Means Clustering (K = 100 or 200)
        -> Discrete Unit Sequence: u_1, u_2, ..., u_T
          -> Deduplicate consecutive repeats
            -> Compressed Unit Sequence: c_1, c_2, ..., c_N  (N << T)

The k-means model is trained on a large corpus of target-language speech. Each hidden representation is assigned to its nearest cluster centroid, producing a discrete unit index. Consecutive duplicate units are removed (deduplication) to create a compressed representation that is more amenable to sequence-to-sequence modeling.

S2UT Model Architecture

The translation model follows an encoder-decoder architecture:

Source Speech (waveform)
  -> wav2vec2 Encoder (pretrained, fine-tuned)
    -> Linear Dimensionality Reduction
      -> Transformer Decoder (cross-attention over encoder output)
        -> Linear + Softmax -> Unit vocabulary probabilities

The encoder processes source-language waveforms through wav2vec2 to produce contextualized representations. The transformer decoder generates target-language discrete units autoregressively, attending to the source representations via cross-attention.

Training Objective

The model is trained with standard sequence-to-sequence cross-entropy loss over the discrete unit vocabulary:

L_S2UT = -sum_{n=1}^{N} log p(c_n | c_{<n}, X_src)

where c_n are the target discrete units and X_src is the source speech. Teacher forcing is used during training with beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens.

Unit-Based Vocoding

At inference time, the predicted discrete unit sequence is converted to a waveform using a unit-conditioned vocoder (Unit HiFi-GAN). The vocoder takes unit indices as input and produces a time-domain waveform:

Predicted Units c_1, ..., c_N
  -> Unit Embedding Layer
    -> HiFi-GAN Generator
      -> Target Speech Waveform

Evaluation

S2UT quality is evaluated using ASR-BLEU: the generated target-language speech is transcribed by an ASR system, and BLEU score is computed between the ASR transcript and the reference translation. This provides an automatic measure of translation quality without requiring human evaluation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment