Principle:Speechbrain Speechbrain TTS Inference Pipeline

Property	Value
Concept	End-to-end text-to-speech synthesis combining acoustic model and vocoder at inference time
Domains	Text_to_Speech, Inference
Repository	speechbrain/speechbrain
Related Implementation	Implementation:Speechbrain_Speechbrain_Tacotron2_Inference_Pipeline

Overview

TTS inference is a two-stage pipeline that transforms text into audible speech. The first stage (acoustic model) converts text into a mel-spectrogram representation, and the second stage (vocoder) converts the mel-spectrogram into a waveform. When combined with speaker embeddings, this pipeline supports zero-shot voice cloning, synthesizing speech in the voice of any speaker given a short reference audio sample.

Pipeline Stages

Stage 1: Acoustic Model (Tacotron2)

The Tacotron2 model operates in autoregressive inference mode, generating mel-spectrogram frames one at a time:

Text encoding: Input text is converted to a sequence of token indices using text_to_sequence with English cleaners
Encoder forward pass: The token sequence is processed by the convolutional encoder and bidirectional LSTM
Speaker conditioning: The speaker embedding (192-dim) is concatenated to each encoder output frame
Autoregressive decoding:
- The decoder starts with a zero-valued mel frame as input
- At each step, it attends to encoder outputs and generates one mel frame plus a gate (stop) prediction
- Generation continues until the gate output exceeds the threshold (0.5) or the maximum step count (1500) is reached
Post-net refinement: The complete mel-spectrogram is refined by the post-net convolutions

The inference is triggered via the model.infer() method, which handles the autoregressive loop internally.

Stage 2: Vocoder (HiFi-GAN)

The HiFi-GAN vocoder converts the mel-spectrogram to a waveform in a single forward pass:

The mel-spectrogram from Tacotron2 is passed directly to the HiFi-GAN generator
The generator upsamples through transposed convolutions (256x factor for 16 kHz)
Weight normalization is removed before inference for clean synthesis
The output waveform is saved as a WAV file at the target sample rate

The vocoder is loaded as a pretrained model from HuggingFace Hub using HIFIGAN.from_hparams().

Zero-Shot Voice Cloning

The pipeline supports synthesizing speech in unseen speaker voices through the following mechanism:

Reference audio acquisition: Obtain a short audio clip (a few seconds) from the target speaker
Speaker embedding extraction: Use the pretrained ECAPA-TDNN encoder to extract a 192-dimensional speaker embedding from the reference audio
Conditioned synthesis: Pass the speaker embedding to Tacotron2 along with the desired text
Waveform generation: Convert the speaker-conditioned mel-spectrogram to audio via HiFi-GAN

This works because the speaker embedding captures voice characteristics independently of speech content, and the Tacotron2 model has learned to modulate its output based on arbitrary speaker embeddings during training.

Stop Prediction Handling

The autoregressive decoder must decide when to stop generating frames. This is managed by the gate network, which predicts a stop probability at each step. Critical considerations include:

Premature stopping: If the gate fires too early, the utterance is truncated. The gate_threshold parameter (default: 0.5) controls sensitivity
Infinite loops: If the gate never fires, the decoder runs until max_decoder_steps (default: 1500 frames). The decoder_no_early_stopping flag can force full-length generation
Attention failures: Poor attention alignment (e.g., repeating or skipping words) can cause both premature stops and infinite loops. Guided attention training mitigates this

Integration Points

The inference pipeline is integrated into the training recipe at two points:

Validation Monitoring

During training, inference samples are generated at regular intervals (every progress_samples_interval epochs) to monitor:

Mel-spectrogram quality (visual comparison with ground truth)
Attention alignment (should be roughly diagonal and monotonic)
Audio quality (when log_audio_samples is enabled)

Test Evaluation

At the end of training, inference samples are generated from the test set for final quality assessment.

Inference Flow Diagram

Input Text: "Hello, how are you today?"
           |
           v
   [Text-to-Sequence Encoding]
           |
           v
   [Tacotron2 Encoder]  +  [Speaker Embedding from ECAPA-TDNN]
           |                          |
           +-----------+--------------+
                       |
                       v
          [Autoregressive Decoder with Attention]
                       |
                       v
              [Post-Net Refinement]
                       |
                       v
              [Mel-Spectrogram: (80, T)]
                       |
                       v
              [HiFi-GAN Generator]
                       |
                       v
              [Waveform: (1, T*256)]
                       |
                       v
              [Save as .wav file]

Performance Considerations

GPU inference: Both models should run on GPU for acceptable speed. The vocoder is loaded with freeze_params=True to avoid unnecessary gradient computation
Batch inference: The current implementation processes one utterance at a time during validation/test monitoring
Memory: Long utterances may require significant GPU memory; max_decoder_steps provides an upper bound

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment