Workflow:Speechbrain Speechbrain Text to Speech Training
| Knowledge Sources | |
|---|---|
| Domains | TTS, Speech_Synthesis, Deep_Learning |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
End-to-end process for training a multi-speaker text-to-speech system using Tacotron2 with speaker embeddings and HiFi-GAN vocoder within SpeechBrain.
Description
This workflow covers the complete pipeline for building a multi-speaker TTS system on the LibriTTS corpus. It involves two major model components: a spectrogram prediction model (multi-speaker Tacotron2) and a neural vocoder (HiFi-GAN) that converts mel-spectrograms to waveforms. The spectrogram model conditions on precomputed speaker embeddings from an ECAPA-TDNN model, enabling zero-shot multi-speaker synthesis. An optional discrete-unit HiFi-GAN variant uses HuBERT-derived speech units instead of mel-spectrograms for speech-to-speech translation applications. The workflow covers speaker embedding extraction, spectrogram model training, vocoder training, and inference synthesis.
Usage
Execute this workflow when you need to build a multi-speaker speech synthesis system that can generate natural-sounding speech from text input. This is appropriate for building voice assistants, audiobook generation, accessibility applications, or the synthesis component of speech-to-speech translation systems. The workflow requires a multi-speaker corpus (LibriTTS) with text transcriptions and supports both continuous mel-spectrogram and discrete unit approaches.
Execution Steps
Step 1: LibriTTS Data Preparation
Prepare the LibriTTS corpus by generating JSON manifest files from the dataset structure. The preparation script processes the multi-speaker data, extracting speaker IDs, text transcriptions, and audio paths. Phoneme-level alignment information is optionally included for improved attention convergence during Tacotron2 training.
Key considerations:
- LibriTTS provides normalized text suitable for TTS training
- Speaker IDs are extracted from the directory hierarchy
- JSON manifests include audio path, text, speaker ID, and duration
- Train-clean-100 and train-clean-360 subsets are typically used
Step 2: Speaker Embedding Precomputation
Extract speaker embeddings for all utterances using a pretrained ECAPA-TDNN model. The embedding extraction script processes each audio file through the speaker recognition model and saves the resulting fixed-dimensional embedding vectors. These embeddings condition the Tacotron2 model during training and inference, enabling multi-speaker synthesis without retraining.
Key considerations:
- Speaker embeddings are precomputed once and cached as tensors
- A pretrained ECAPA-TDNN model is downloaded from HuggingFace
- Embeddings capture speaker identity characteristics (timbre, pitch, speaking style)
- For zero-shot synthesis, embeddings from any reference utterance can be used at inference
Step 3: Tacotron2 Spectrogram Model Training
Train the multi-speaker Tacotron2 model that predicts mel-spectrograms from text sequences conditioned on speaker embeddings. The Brain subclass implements compute_forward() to encode text, attend to the encoded sequence, and autoregressively generate mel-spectrogram frames. The loss combines mel-spectrogram reconstruction (L1 + MSE), stop token prediction (gate loss), and optional speaker consistency regularization.
Key considerations:
- Tacotron2 uses attention to align text encoder outputs with mel-spectrogram frames
- Teacher forcing is used during training (ground-truth mel frames as decoder input)
- Gate loss trains a stop token predictor to determine when to stop generation
- Speaker embeddings are concatenated with the encoder output or decoder input
- Progress samples are periodically generated and logged for monitoring
Step 4: HiFi_GAN Vocoder Training
Train the HiFi-GAN vocoder to convert mel-spectrograms into high-quality waveforms. The vocoder is trained separately from the spectrogram model using a GAN framework with multiple discriminators (multi-period and multi-scale). The generator uses transposed convolutions with multi-receptive field fusion blocks, while discriminators operate on both raw waveform and multi-resolution STFT representations.
Key considerations:
- The vocoder is dataset-independent and can be trained on any mel-spectrogram data
- Training alternates between generator and discriminator updates
- Loss combines adversarial, feature matching, and mel-spectrogram reconstruction losses
- A discrete-unit variant accepts HuBERT codes instead of mel-spectrograms for S2ST applications
Step 5: Inference and Speech Synthesis
Generate speech by passing text through the trained pipeline: text encoding through Tacotron2 to produce mel-spectrograms, then mel-spectrogram to waveform conversion through HiFi-GAN. At inference time, the model runs autoregressively without teacher forcing, using its own predictions as input for subsequent frames. Speaker identity is controlled by providing the desired speaker embedding.
Key considerations:
- Inference is autoregressive and slower than training due to sequential frame generation
- The stop token predictor determines when generation terminates
- Different speaker embeddings produce speech in different voices
- Post-processing (denoising, volume normalization) can be applied to the output waveform