Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Speechbrain Speechbrain Text to Speech Training

From Leeroopedia


Knowledge Sources
Domains TTS, Speech_Synthesis, Deep_Learning
Last Updated 2026-02-09 19:00 GMT

Overview

End-to-end process for training a multi-speaker text-to-speech system using Tacotron2 with speaker embeddings and HiFi-GAN vocoder within SpeechBrain.

Description

This workflow covers the complete pipeline for building a multi-speaker TTS system on the LibriTTS corpus. It involves two major model components: a spectrogram prediction model (multi-speaker Tacotron2) and a neural vocoder (HiFi-GAN) that converts mel-spectrograms to waveforms. The spectrogram model conditions on precomputed speaker embeddings from an ECAPA-TDNN model, enabling zero-shot multi-speaker synthesis. An optional discrete-unit HiFi-GAN variant uses HuBERT-derived speech units instead of mel-spectrograms for speech-to-speech translation applications. The workflow covers speaker embedding extraction, spectrogram model training, vocoder training, and inference synthesis.

Usage

Execute this workflow when you need to build a multi-speaker speech synthesis system that can generate natural-sounding speech from text input. This is appropriate for building voice assistants, audiobook generation, accessibility applications, or the synthesis component of speech-to-speech translation systems. The workflow requires a multi-speaker corpus (LibriTTS) with text transcriptions and supports both continuous mel-spectrogram and discrete unit approaches.

Execution Steps

Step 1: LibriTTS Data Preparation

Prepare the LibriTTS corpus by generating JSON manifest files from the dataset structure. The preparation script processes the multi-speaker data, extracting speaker IDs, text transcriptions, and audio paths. Phoneme-level alignment information is optionally included for improved attention convergence during Tacotron2 training.

Key considerations:

  • LibriTTS provides normalized text suitable for TTS training
  • Speaker IDs are extracted from the directory hierarchy
  • JSON manifests include audio path, text, speaker ID, and duration
  • Train-clean-100 and train-clean-360 subsets are typically used

Step 2: Speaker Embedding Precomputation

Extract speaker embeddings for all utterances using a pretrained ECAPA-TDNN model. The embedding extraction script processes each audio file through the speaker recognition model and saves the resulting fixed-dimensional embedding vectors. These embeddings condition the Tacotron2 model during training and inference, enabling multi-speaker synthesis without retraining.

Key considerations:

  • Speaker embeddings are precomputed once and cached as tensors
  • A pretrained ECAPA-TDNN model is downloaded from HuggingFace
  • Embeddings capture speaker identity characteristics (timbre, pitch, speaking style)
  • For zero-shot synthesis, embeddings from any reference utterance can be used at inference

Step 3: Tacotron2 Spectrogram Model Training

Train the multi-speaker Tacotron2 model that predicts mel-spectrograms from text sequences conditioned on speaker embeddings. The Brain subclass implements compute_forward() to encode text, attend to the encoded sequence, and autoregressively generate mel-spectrogram frames. The loss combines mel-spectrogram reconstruction (L1 + MSE), stop token prediction (gate loss), and optional speaker consistency regularization.

Key considerations:

  • Tacotron2 uses attention to align text encoder outputs with mel-spectrogram frames
  • Teacher forcing is used during training (ground-truth mel frames as decoder input)
  • Gate loss trains a stop token predictor to determine when to stop generation
  • Speaker embeddings are concatenated with the encoder output or decoder input
  • Progress samples are periodically generated and logged for monitoring

Step 4: HiFi_GAN Vocoder Training

Train the HiFi-GAN vocoder to convert mel-spectrograms into high-quality waveforms. The vocoder is trained separately from the spectrogram model using a GAN framework with multiple discriminators (multi-period and multi-scale). The generator uses transposed convolutions with multi-receptive field fusion blocks, while discriminators operate on both raw waveform and multi-resolution STFT representations.

Key considerations:

  • The vocoder is dataset-independent and can be trained on any mel-spectrogram data
  • Training alternates between generator and discriminator updates
  • Loss combines adversarial, feature matching, and mel-spectrogram reconstruction losses
  • A discrete-unit variant accepts HuBERT codes instead of mel-spectrograms for S2ST applications

Step 5: Inference and Speech Synthesis

Generate speech by passing text through the trained pipeline: text encoding through Tacotron2 to produce mel-spectrograms, then mel-spectrogram to waveform conversion through HiFi-GAN. At inference time, the model runs autoregressively without teacher forcing, using its own predictions as input for subsequent frames. Speaker identity is controlled by providing the desired speaker embedding.

Key considerations:

  • Inference is autoregressive and slower than training due to sequential frame generation
  • The stop token predictor determines when generation terminates
  • Different speaker embeddings produce speech in different voices
  • Post-processing (denoising, volume normalization) can be applied to the output waveform

Execution Diagram

GitHub URL

Workflow Repository