Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain Train CVSS S2UT

From Leeroopedia


Knowledge Sources
Domains Speech_Translation, Training
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for speech-to-unit translation (S2UT) training on the CVSS dataset provided by the SpeechBrain library.

Description

This recipe defines the S2UT class (subclass of sb.core.Brain) for training a direct speech-to-speech translation system using discrete units. The model uses a wav2vec2 encoder to extract features from source speech, passes them through a dimensionality reduction layer, and then decodes with a Transformer decoder-only architecture to predict discrete unit tokens. The implementation is based on the papers "Direct speech-to-speech translation with discrete units" and "Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation."

Usage

Use this recipe to train a speech-to-unit translation model on the CVSS corpus. Requires source audio data (e.g., CommonVoice) and target CVSS data with pre-extracted discrete unit codes. Supports evaluation with ASR-BLEU metrics and optional vocoder-based waveform synthesis via UnitHIFIGAN.

Code Reference

Source Location

Signature

class S2UT(sb.core.Brain):
    def compute_forward(self, batch, stage):
        ...
    def compute_objectives(self, predictions, batch, stage):
        ...

Import

python recipes/CVSS/S2ST/train.py hparams/train_fr-en.yaml --src_data_folder=/corpus/CommonVoice/fr --tgt_data_folder=/corpus/CVSS/fr

I/O Contract

Inputs

Name Type Required Description
batch PaddedBatch Yes Batch containing src_sig (source waveforms) and code_bos (target unit codes with BOS)
stage sb.Stage Yes TRAIN, VALID, or TEST

Outputs

Name Type Description
predictions tuple Log-softmax probabilities, optional hypotheses, optional synthesized wavs, optional transcripts
loss torch.Tensor Sequence-level NLL loss on predicted unit tokens

Usage Examples

python train.py hparams/train_fr-en.yaml --src_data_folder=/corpus/CommonVoice/fr --tgt_data_folder=/corpus/CVSS/fr

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment