Workflow:Speechbrain Speechbrain Text to Speech Training

Knowledge Sources	SpeechBrain SpeechBrain Docs
Domains	TTS, Speech_Synthesis, Deep_Learning
Last Updated	2026-02-09 19:00 GMT

Overview

End-to-end process for training a multi-speaker text-to-speech system using Tacotron2 with speaker embeddings and HiFi-GAN vocoder within SpeechBrain.

Description

This workflow covers the complete pipeline for building a multi-speaker TTS system on the LibriTTS corpus. It involves two major model components: a spectrogram prediction model (multi-speaker Tacotron2) and a neural vocoder (HiFi-GAN) that converts mel-spectrograms to waveforms. The spectrogram model conditions on precomputed speaker embeddings from an ECAPA-TDNN model, enabling zero-shot multi-speaker synthesis. An optional discrete-unit HiFi-GAN variant uses HuBERT-derived speech units instead of mel-spectrograms for speech-to-speech translation applications. The workflow covers speaker embedding extraction, spectrogram model training, vocoder training, and inference synthesis.

Usage

Execute this workflow when you need to build a multi-speaker speech synthesis system that can generate natural-sounding speech from text input. This is appropriate for building voice assistants, audiobook generation, accessibility applications, or the synthesis component of speech-to-speech translation systems. The workflow requires a multi-speaker corpus (LibriTTS) with text transcriptions and supports both continuous mel-spectrogram and discrete unit approaches.

Execution Steps

Step 1: LibriTTS Data Preparation

Prepare the LibriTTS corpus by generating JSON manifest files from the dataset structure. The preparation script processes the multi-speaker data, extracting speaker IDs, text transcriptions, and audio paths. Phoneme-level alignment information is optionally included for improved attention convergence during Tacotron2 training.

Key considerations:

LibriTTS provides normalized text suitable for TTS training
Speaker IDs are extracted from the directory hierarchy
JSON manifests include audio path, text, speaker ID, and duration
Train-clean-100 and train-clean-360 subsets are typically used

Step 2: Speaker Embedding Precomputation

Extract speaker embeddings for all utterances using a pretrained ECAPA-TDNN model. The embedding extraction script processes each audio file through the speaker recognition model and saves the resulting fixed-dimensional embedding vectors. These embeddings condition the Tacotron2 model during training and inference, enabling multi-speaker synthesis without retraining.

Key considerations:

Speaker embeddings are precomputed once and cached as tensors
A pretrained ECAPA-TDNN model is downloaded from HuggingFace
Embeddings capture speaker identity characteristics (timbre, pitch, speaking style)
For zero-shot synthesis, embeddings from any reference utterance can be used at inference

Step 3: Tacotron2 Spectrogram Model Training

Train the multi-speaker Tacotron2 model that predicts mel-spectrograms from text sequences conditioned on speaker embeddings. The Brain subclass implements compute_forward() to encode text, attend to the encoded sequence, and autoregressively generate mel-spectrogram frames. The loss combines mel-spectrogram reconstruction (L1 + MSE), stop token prediction (gate loss), and optional speaker consistency regularization.

Key considerations:

Tacotron2 uses attention to align text encoder outputs with mel-spectrogram frames
Teacher forcing is used during training (ground-truth mel frames as decoder input)
Gate loss trains a stop token predictor to determine when to stop generation
Speaker embeddings are concatenated with the encoder output or decoder input
Progress samples are periodically generated and logged for monitoring

Step 4: HiFi_GAN Vocoder Training

Train the HiFi-GAN vocoder to convert mel-spectrograms into high-quality waveforms. The vocoder is trained separately from the spectrogram model using a GAN framework with multiple discriminators (multi-period and multi-scale). The generator uses transposed convolutions with multi-receptive field fusion blocks, while discriminators operate on both raw waveform and multi-resolution STFT representations.

Key considerations:

The vocoder is dataset-independent and can be trained on any mel-spectrogram data
Training alternates between generator and discriminator updates
Loss combines adversarial, feature matching, and mel-spectrogram reconstruction losses
A discrete-unit variant accepts HuBERT codes instead of mel-spectrograms for S2ST applications

Step 5: Inference and Speech Synthesis

Generate speech by passing text through the trained pipeline: text encoding through Tacotron2 to produce mel-spectrograms, then mel-spectrogram to waveform conversion through HiFi-GAN. At inference time, the model runs autoregressively without teacher forcing, using its own predictions as input for subsequent frames. Speaker identity is controlled by providing the desired speaker embedding.

Key considerations:

Inference is autoregressive and slower than training due to sequential frame generation
The stop token predictor determines when generation terminates
Different speaker embeddings produce speech in different voices
Post-processing (denoising, volume normalization) can be applied to the output waveform

Execution Diagram

GitHub URL

Workflow Repository