Principle:Speechbrain Speechbrain Tacotron2 Acoustic Model Training

Property	Value
Concept	Training autoregressive attention-based acoustic models that convert text to mel-spectrograms
Domains	Text_to_Speech, Sequence_Modeling
Repository	speechbrain/speechbrain
Knowledge Sources	Shen et al. 2018 "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions"
Related Implementation	Implementation:Speechbrain_Speechbrain_Tacotron2Brain_Compute_Forward

Overview

Tacotron2 is a sequence-to-sequence model with attention that converts character or phoneme sequences into mel-spectrograms. The SpeechBrain implementation extends the original architecture for multi-speaker synthesis by conditioning on precomputed speaker embeddings, enabling a single model to generate speech in many different voices.

Architecture

The Tacotron2 architecture consists of four major components:

Encoder

The encoder transforms a sequence of text tokens (characters or phonemes) into a hidden representation:

Embedding layer: Maps discrete token indices to 1024-dimensional dense vectors
Convolutional stack: 6 layers of 1D convolutions (kernel size 5) with batch normalization and ReLU activation, extracting local context
Bidirectional LSTM: Captures long-range dependencies in the text, producing encoder outputs of dimension 1024

Attention Mechanism

The location-sensitive attention mechanism aligns encoder outputs with decoder steps:

Combines content-based attention (query-key dot product) with location-based attention (convolution over previous alignment weights)
Uses 32 location filters with kernel size 31
Produces a context vector at each decoder step by computing a weighted sum over encoder outputs
Enables monotonic left-to-right alignment between text and speech

Decoder

The autoregressive decoder generates mel-spectrogram frames one at a time:

Pre-net: Two fully connected layers (512 units each) with dropout, processing the previous mel frame as input
Attention RNN: A single-layer LSTM (2048 units) that combines pre-net output with the previous context vector
Decoder RNN: A single-layer LSTM (2048 units) that combines the attention RNN output with the current context vector
Linear projection: Projects decoder RNN output to 80 mel-frequency channels
Gate network: A linear layer with sigmoid activation that predicts when to stop generation

Post-Net

A post-processing network refines the decoder's mel-spectrogram output:

10 convolutional layers (kernel size 5, embedding dim 1024) with batch normalization and tanh activation
Predicts a residual that is added to the decoder output
Improves spectral detail and reduces artifacts

Multi-Speaker Conditioning

Speaker identity is injected by concatenating the 192-dimensional speaker embedding to the encoder output at every time step. This allows the decoder's attention and generation process to be influenced by the target speaker's vocal characteristics.

Training Loss

The training objective combines multiple loss terms:

Mel-Spectrogram Loss

Mean Squared Error (MSE) is computed between predicted and target mel-spectrograms at two stages:

Pre-net MSE: Loss on the decoder's direct output (before post-net)
Post-net MSE: Loss on the refined output (after post-net)

Both losses are summed, encouraging the decoder to produce reasonable output even without the post-net.

Gate Loss

Binary Cross-Entropy (BCE) loss on the stop token predictions. The target is 0 for all frames except the last, which is 1. Weighted by gate_loss_weight (default: 1.0).

Guided Attention Loss

An auxiliary loss that encourages roughly diagonal (monotonic) attention alignments during early training:

Penalizes attention weights that deviate from the diagonal
Weight is controlled by guided_attention_weight (default: 25.0) with exponential decay (half-life of 25 epochs)
Hard stop after epoch 50 (attention loss weight drops to 0)
guided_attention_sigma (default: 0.2) controls the width of the allowed diagonal band

Training Configuration

Key hyperparameters from the SpeechBrain recipe:

Parameter	Value	Description
Epochs	700	Total training epochs
Batch size	32	Minimum 2 for batch normalization
Learning rate	0.001	With Noam scheduler (4000 warmup steps)
Optimizer	Adam	With weight decay 0.000006
Sample rate	16000 Hz	Audio sample rate
Mel channels	80	Number of mel-frequency bins
Max decoder steps	1500	Upper bound for inference generation length
Gate threshold	0.5	Stop generation when gate output exceeds this

Training Loop Specifics

The Tacotron2Brain class extends SpeechBrain's Brain with custom behavior:

fit_batch: Calls the parent fit_batch and then applies learning rate annealing via the Noam scheduler
batch_to_device: Custom method that unpacks the complex batch tuple (text, mel, gate, speaker embeddings, speaker IDs) and transfers each tensor to the target device
Progress logging: Saves spectrogram images and audio samples every 10 epochs for monitoring training quality
Checkpoint management: Saves checkpoints based on validation loss with optional interval-based retention

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment