Principle:Speechbrain Speechbrain HiFi GAN Vocoder Training
| Property | Value |
|---|---|
| Concept | Training GAN-based neural vocoders that convert mel-spectrograms to high-fidelity waveforms |
| Domains | Text_to_Speech, GAN_Training |
| Repository | speechbrain/speechbrain |
| Knowledge Sources | Kong et al. 2020 "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis" |
| Related Implementation | Implementation:Speechbrain_Speechbrain_HifiGanBrain_Fit_Batch |
Overview
HiFi-GAN is a generative adversarial network (GAN) that synthesizes high-fidelity speech waveforms from mel-spectrogram inputs. It serves as the vocoder stage in a two-stage TTS pipeline, converting the mel-spectrograms produced by an acoustic model (such as Tacotron2) into audible waveforms. HiFi-GAN achieves synthesis quality comparable to human speech while maintaining fast inference speed.
Architecture
Generator
The generator converts a mel-spectrogram (80 channels) into a raw waveform (1 channel) through progressive upsampling:
Transposed Convolution Upsampling
A sequence of transposed 1D convolutions increases the temporal resolution from mel-spectrogram frame rate to waveform sample rate. With the default configuration:
- Upsample factors: [8, 8, 2, 2] (total: 256x, matching the hop length)
- Upsample kernel sizes: [16, 16, 4, 4]
- Initial channel count: 512, halving at each stage
Multi-Receptive Field Fusion (MRF)
After each upsampling stage, a Multi-Receptive Field Fusion module applies multiple parallel residual blocks with different kernel sizes and dilation rates:
- Kernel sizes: [3, 7, 11]
- Dilation sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
Each residual block captures patterns at a different temporal scale. The outputs are summed to produce a rich multi-scale representation. This is the key innovation of HiFi-GAN: modeling periodic patterns at multiple scales simultaneously.
Output Convolution
A final 1D convolution with tanh activation produces the single-channel waveform output.
Discriminator
HiFi-GAN uses a composite discriminator consisting of two types of sub-discriminators:
Multi-Period Discriminator (MPD)
The MPD reshapes the 1D waveform into 2D representations at different periods (2, 3, 5, 7, 11) and applies 2D convolutions. This captures periodic structures in speech at different fundamental frequencies.
Multi-Scale Discriminator (MSD)
The MSD operates on the raw waveform and its downsampled versions (2x, 4x), using 1D convolutions to evaluate audio quality at different temporal scales. This captures both fine-grained sample-level details and coarse spectral patterns.
Both discriminators produce multiple real/fake scores and intermediate features used in the training losses.
Training Losses
HiFi-GAN training combines three categories of losses:
Generator Losses
- L1 Mel-Spectrogram Reconstruction Loss (
l1_spec_loss): Computes the L1 distance between the mel-spectrogram of the generated waveform and the target mel-spectrogram. This provides a strong supervision signal for spectral accuracy. Weight: 45.
- Feature Matching Loss (
feat_match_loss): Computes the L1 distance between intermediate features of the discriminator for real and generated audio. This encourages the generator to match the discriminator's internal representations. Weight: 10.
- Adversarial Loss (
mseg_loss): Mean Squared Error between discriminator scores for generated audio and the target value of 1 (real). Encourages the generator to fool the discriminator. Weight: 1.
Discriminator Loss
- MSE Discriminator Loss (
msed_loss): Mean Squared Error between discriminator scores for real audio (target: 1) and generated audio (target: 0). Trains the discriminator to distinguish real from generated speech.
Adversarial Training Strategy
The generator and discriminator are trained alternately within each batch:
- Step 1 - Train discriminator: Forward pass through generator (detached), compute discriminator loss on real vs. generated audio, update discriminator weights
- Step 2 - Train generator: Re-score generated audio with updated discriminator, compute generator losses (adversarial + feature matching + mel reconstruction), update generator weights
This alternating approach prevents mode collapse and ensures the discriminator provides useful gradients to the generator throughout training.
Training Configuration
Key hyperparameters from the SpeechBrain recipe:
| Parameter | Value | Description |
|---|---|---|
| Epochs | 100 | Total training epochs |
| Batch size | 64 | Training batch size |
| Segment size | 8192 | Waveform segment length in samples (~0.5s at 16kHz) |
| Learning rate | 0.0002 | For both generator and discriminator |
| Optimizer | AdamW | Betas: (0.8, 0.99) |
| LR schedule | ExponentialLR | Gamma: 0.9999 |
| Sample rate | 16000 Hz | Audio sample rate |
| Generator version | V1 | 512 initial channels, resblock type "1" |
Segment-Based Training
Unlike the acoustic model which processes full utterances, the vocoder trains on fixed-length audio segments (8192 samples at 16 kHz, approximately 0.5 seconds). This is done because:
- Full-utterance training would require excessive memory for waveform generation
- Random segment extraction provides data augmentation
- Short segments are sufficient for learning local waveform patterns
The segment field in the data manifest controls whether random cropping is applied.
Inference
At inference time, the generator processes the complete mel-spectrogram in a single forward pass (no segmentation). Weight normalization, used during training for stability, is removed before inference to avoid artifacts.
See Also
- Implementation:Speechbrain_Speechbrain_HifiGanBrain_Fit_Batch - The
HifiGanBrainclass implementing the training loop - Heuristic:Speechbrain_Speechbrain_GAN_Dual_Optimizer_Pattern
- Principle:Speechbrain_Speechbrain_Tacotron2_Acoustic_Model_Training - Acoustic model that produces mel-spectrograms for vocoding
- Principle:Speechbrain_Speechbrain_TTS_Inference_Pipeline - End-to-end pipeline combining acoustic model and vocoder