Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain Tacotron2 Acoustic Model Training

From Leeroopedia


Property Value
Concept Training autoregressive attention-based acoustic models that convert text to mel-spectrograms
Domains Text_to_Speech, Sequence_Modeling
Repository speechbrain/speechbrain
Knowledge Sources Shen et al. 2018 "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions"
Related Implementation Implementation:Speechbrain_Speechbrain_Tacotron2Brain_Compute_Forward

Overview

Tacotron2 is a sequence-to-sequence model with attention that converts character or phoneme sequences into mel-spectrograms. The SpeechBrain implementation extends the original architecture for multi-speaker synthesis by conditioning on precomputed speaker embeddings, enabling a single model to generate speech in many different voices.

Architecture

The Tacotron2 architecture consists of four major components:

Encoder

The encoder transforms a sequence of text tokens (characters or phonemes) into a hidden representation:

  1. Embedding layer: Maps discrete token indices to 1024-dimensional dense vectors
  2. Convolutional stack: 6 layers of 1D convolutions (kernel size 5) with batch normalization and ReLU activation, extracting local context
  3. Bidirectional LSTM: Captures long-range dependencies in the text, producing encoder outputs of dimension 1024

Attention Mechanism

The location-sensitive attention mechanism aligns encoder outputs with decoder steps:

  • Combines content-based attention (query-key dot product) with location-based attention (convolution over previous alignment weights)
  • Uses 32 location filters with kernel size 31
  • Produces a context vector at each decoder step by computing a weighted sum over encoder outputs
  • Enables monotonic left-to-right alignment between text and speech

Decoder

The autoregressive decoder generates mel-spectrogram frames one at a time:

  1. Pre-net: Two fully connected layers (512 units each) with dropout, processing the previous mel frame as input
  2. Attention RNN: A single-layer LSTM (2048 units) that combines pre-net output with the previous context vector
  3. Decoder RNN: A single-layer LSTM (2048 units) that combines the attention RNN output with the current context vector
  4. Linear projection: Projects decoder RNN output to 80 mel-frequency channels
  5. Gate network: A linear layer with sigmoid activation that predicts when to stop generation

Post-Net

A post-processing network refines the decoder's mel-spectrogram output:

  • 10 convolutional layers (kernel size 5, embedding dim 1024) with batch normalization and tanh activation
  • Predicts a residual that is added to the decoder output
  • Improves spectral detail and reduces artifacts

Multi-Speaker Conditioning

Speaker identity is injected by concatenating the 192-dimensional speaker embedding to the encoder output at every time step. This allows the decoder's attention and generation process to be influenced by the target speaker's vocal characteristics.

Training Loss

The training objective combines multiple loss terms:

Mel-Spectrogram Loss

Mean Squared Error (MSE) is computed between predicted and target mel-spectrograms at two stages:

  • Pre-net MSE: Loss on the decoder's direct output (before post-net)
  • Post-net MSE: Loss on the refined output (after post-net)

Both losses are summed, encouraging the decoder to produce reasonable output even without the post-net.

Gate Loss

Binary Cross-Entropy (BCE) loss on the stop token predictions. The target is 0 for all frames except the last, which is 1. Weighted by gate_loss_weight (default: 1.0).

Guided Attention Loss

An auxiliary loss that encourages roughly diagonal (monotonic) attention alignments during early training:

  • Penalizes attention weights that deviate from the diagonal
  • Weight is controlled by guided_attention_weight (default: 25.0) with exponential decay (half-life of 25 epochs)
  • Hard stop after epoch 50 (attention loss weight drops to 0)
  • guided_attention_sigma (default: 0.2) controls the width of the allowed diagonal band

Training Configuration

Key hyperparameters from the SpeechBrain recipe:

Parameter Value Description
Epochs 700 Total training epochs
Batch size 32 Minimum 2 for batch normalization
Learning rate 0.001 With Noam scheduler (4000 warmup steps)
Optimizer Adam With weight decay 0.000006
Sample rate 16000 Hz Audio sample rate
Mel channels 80 Number of mel-frequency bins
Max decoder steps 1500 Upper bound for inference generation length
Gate threshold 0.5 Stop generation when gate output exceeds this

Training Loop Specifics

The Tacotron2Brain class extends SpeechBrain's Brain with custom behavior:

  • fit_batch: Calls the parent fit_batch and then applies learning rate annealing via the Noam scheduler
  • batch_to_device: Custom method that unpacks the complex batch tuple (text, mel, gate, speaker embeddings, speaker IDs) and transfers each tensor to the target device
  • Progress logging: Saves spectrogram images and audio samples every 10 epochs for monitoring training quality
  • Checkpoint management: Saves checkpoints based on validation loss with optional interval-based retention

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment