Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain TTS Inference Pipeline

From Leeroopedia


Property Value
Concept End-to-end text-to-speech synthesis combining acoustic model and vocoder at inference time
Domains Text_to_Speech, Inference
Repository speechbrain/speechbrain
Related Implementation Implementation:Speechbrain_Speechbrain_Tacotron2_Inference_Pipeline

Overview

TTS inference is a two-stage pipeline that transforms text into audible speech. The first stage (acoustic model) converts text into a mel-spectrogram representation, and the second stage (vocoder) converts the mel-spectrogram into a waveform. When combined with speaker embeddings, this pipeline supports zero-shot voice cloning, synthesizing speech in the voice of any speaker given a short reference audio sample.

Pipeline Stages

Stage 1: Acoustic Model (Tacotron2)

The Tacotron2 model operates in autoregressive inference mode, generating mel-spectrogram frames one at a time:

  1. Text encoding: Input text is converted to a sequence of token indices using text_to_sequence with English cleaners
  2. Encoder forward pass: The token sequence is processed by the convolutional encoder and bidirectional LSTM
  3. Speaker conditioning: The speaker embedding (192-dim) is concatenated to each encoder output frame
  4. Autoregressive decoding:
    • The decoder starts with a zero-valued mel frame as input
    • At each step, it attends to encoder outputs and generates one mel frame plus a gate (stop) prediction
    • Generation continues until the gate output exceeds the threshold (0.5) or the maximum step count (1500) is reached
  5. Post-net refinement: The complete mel-spectrogram is refined by the post-net convolutions

The inference is triggered via the model.infer() method, which handles the autoregressive loop internally.

Stage 2: Vocoder (HiFi-GAN)

The HiFi-GAN vocoder converts the mel-spectrogram to a waveform in a single forward pass:

  1. The mel-spectrogram from Tacotron2 is passed directly to the HiFi-GAN generator
  2. The generator upsamples through transposed convolutions (256x factor for 16 kHz)
  3. Weight normalization is removed before inference for clean synthesis
  4. The output waveform is saved as a WAV file at the target sample rate

The vocoder is loaded as a pretrained model from HuggingFace Hub using HIFIGAN.from_hparams().

Zero-Shot Voice Cloning

The pipeline supports synthesizing speech in unseen speaker voices through the following mechanism:

  1. Reference audio acquisition: Obtain a short audio clip (a few seconds) from the target speaker
  2. Speaker embedding extraction: Use the pretrained ECAPA-TDNN encoder to extract a 192-dimensional speaker embedding from the reference audio
  3. Conditioned synthesis: Pass the speaker embedding to Tacotron2 along with the desired text
  4. Waveform generation: Convert the speaker-conditioned mel-spectrogram to audio via HiFi-GAN

This works because the speaker embedding captures voice characteristics independently of speech content, and the Tacotron2 model has learned to modulate its output based on arbitrary speaker embeddings during training.

Stop Prediction Handling

The autoregressive decoder must decide when to stop generating frames. This is managed by the gate network, which predicts a stop probability at each step. Critical considerations include:

  • Premature stopping: If the gate fires too early, the utterance is truncated. The gate_threshold parameter (default: 0.5) controls sensitivity
  • Infinite loops: If the gate never fires, the decoder runs until max_decoder_steps (default: 1500 frames). The decoder_no_early_stopping flag can force full-length generation
  • Attention failures: Poor attention alignment (e.g., repeating or skipping words) can cause both premature stops and infinite loops. Guided attention training mitigates this

Integration Points

The inference pipeline is integrated into the training recipe at two points:

Validation Monitoring

During training, inference samples are generated at regular intervals (every progress_samples_interval epochs) to monitor:

  • Mel-spectrogram quality (visual comparison with ground truth)
  • Attention alignment (should be roughly diagonal and monotonic)
  • Audio quality (when log_audio_samples is enabled)

Test Evaluation

At the end of training, inference samples are generated from the test set for final quality assessment.

Inference Flow Diagram

Input Text: "Hello, how are you today?"
           |
           v
   [Text-to-Sequence Encoding]
           |
           v
   [Tacotron2 Encoder]  +  [Speaker Embedding from ECAPA-TDNN]
           |                          |
           +-----------+--------------+
                       |
                       v
          [Autoregressive Decoder with Attention]
                       |
                       v
              [Post-Net Refinement]
                       |
                       v
              [Mel-Spectrogram: (80, T)]
                       |
                       v
              [HiFi-GAN Generator]
                       |
                       v
              [Waveform: (1, T*256)]
                       |
                       v
              [Save as .wav file]

Performance Considerations

  • GPU inference: Both models should run on GPU for acceptable speed. The vocoder is loaded with freeze_params=True to avoid unnecessary gradient computation
  • Batch inference: The current implementation processes one utterance at a time during validation/test monitoring
  • Memory: Long utterances may require significant GPU memory; max_decoder_steps provides an upper bound

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment