Principle:Speechbrain Speechbrain TTS Inference Pipeline
| Property | Value |
|---|---|
| Concept | End-to-end text-to-speech synthesis combining acoustic model and vocoder at inference time |
| Domains | Text_to_Speech, Inference |
| Repository | speechbrain/speechbrain |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Tacotron2_Inference_Pipeline |
Overview
TTS inference is a two-stage pipeline that transforms text into audible speech. The first stage (acoustic model) converts text into a mel-spectrogram representation, and the second stage (vocoder) converts the mel-spectrogram into a waveform. When combined with speaker embeddings, this pipeline supports zero-shot voice cloning, synthesizing speech in the voice of any speaker given a short reference audio sample.
Pipeline Stages
Stage 1: Acoustic Model (Tacotron2)
The Tacotron2 model operates in autoregressive inference mode, generating mel-spectrogram frames one at a time:
- Text encoding: Input text is converted to a sequence of token indices using
text_to_sequencewith English cleaners - Encoder forward pass: The token sequence is processed by the convolutional encoder and bidirectional LSTM
- Speaker conditioning: The speaker embedding (192-dim) is concatenated to each encoder output frame
- Autoregressive decoding:
- The decoder starts with a zero-valued mel frame as input
- At each step, it attends to encoder outputs and generates one mel frame plus a gate (stop) prediction
- Generation continues until the gate output exceeds the threshold (0.5) or the maximum step count (1500) is reached
- Post-net refinement: The complete mel-spectrogram is refined by the post-net convolutions
The inference is triggered via the model.infer() method, which handles the autoregressive loop internally.
Stage 2: Vocoder (HiFi-GAN)
The HiFi-GAN vocoder converts the mel-spectrogram to a waveform in a single forward pass:
- The mel-spectrogram from Tacotron2 is passed directly to the HiFi-GAN generator
- The generator upsamples through transposed convolutions (256x factor for 16 kHz)
- Weight normalization is removed before inference for clean synthesis
- The output waveform is saved as a WAV file at the target sample rate
The vocoder is loaded as a pretrained model from HuggingFace Hub using HIFIGAN.from_hparams().
Zero-Shot Voice Cloning
The pipeline supports synthesizing speech in unseen speaker voices through the following mechanism:
- Reference audio acquisition: Obtain a short audio clip (a few seconds) from the target speaker
- Speaker embedding extraction: Use the pretrained ECAPA-TDNN encoder to extract a 192-dimensional speaker embedding from the reference audio
- Conditioned synthesis: Pass the speaker embedding to Tacotron2 along with the desired text
- Waveform generation: Convert the speaker-conditioned mel-spectrogram to audio via HiFi-GAN
This works because the speaker embedding captures voice characteristics independently of speech content, and the Tacotron2 model has learned to modulate its output based on arbitrary speaker embeddings during training.
Stop Prediction Handling
The autoregressive decoder must decide when to stop generating frames. This is managed by the gate network, which predicts a stop probability at each step. Critical considerations include:
- Premature stopping: If the gate fires too early, the utterance is truncated. The
gate_thresholdparameter (default: 0.5) controls sensitivity - Infinite loops: If the gate never fires, the decoder runs until
max_decoder_steps(default: 1500 frames). Thedecoder_no_early_stoppingflag can force full-length generation - Attention failures: Poor attention alignment (e.g., repeating or skipping words) can cause both premature stops and infinite loops. Guided attention training mitigates this
Integration Points
The inference pipeline is integrated into the training recipe at two points:
Validation Monitoring
During training, inference samples are generated at regular intervals (every progress_samples_interval epochs) to monitor:
- Mel-spectrogram quality (visual comparison with ground truth)
- Attention alignment (should be roughly diagonal and monotonic)
- Audio quality (when
log_audio_samplesis enabled)
Test Evaluation
At the end of training, inference samples are generated from the test set for final quality assessment.
Inference Flow Diagram
Input Text: "Hello, how are you today?"
|
v
[Text-to-Sequence Encoding]
|
v
[Tacotron2 Encoder] + [Speaker Embedding from ECAPA-TDNN]
| |
+-----------+--------------+
|
v
[Autoregressive Decoder with Attention]
|
v
[Post-Net Refinement]
|
v
[Mel-Spectrogram: (80, T)]
|
v
[HiFi-GAN Generator]
|
v
[Waveform: (1, T*256)]
|
v
[Save as .wav file]
Performance Considerations
- GPU inference: Both models should run on GPU for acceptable speed. The vocoder is loaded with
freeze_params=Trueto avoid unnecessary gradient computation - Batch inference: The current implementation processes one utterance at a time during validation/test monitoring
- Memory: Long utterances may require significant GPU memory;
max_decoder_stepsprovides an upper bound
See Also
- Implementation:Speechbrain_Speechbrain_Tacotron2_Inference_Pipeline - Code implementing the inference pipeline
- Principle:Speechbrain_Speechbrain_Tacotron2_Acoustic_Model_Training - Training the acoustic model used in Stage 1
- Principle:Speechbrain_Speechbrain_HiFi_GAN_Vocoder_Training - Training the vocoder used in Stage 2
- Principle:Speechbrain_Speechbrain_Speaker_Embedding_Precomputation - Speaker embeddings enabling voice cloning