Principle:Speechbrain Speechbrain HiFi GAN Vocoder Training

Property	Value
Concept	Training GAN-based neural vocoders that convert mel-spectrograms to high-fidelity waveforms
Domains	Text_to_Speech, GAN_Training
Repository	speechbrain/speechbrain
Knowledge Sources	Kong et al. 2020 "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis"
Related Implementation	Implementation:Speechbrain_Speechbrain_HifiGanBrain_Fit_Batch

Overview

HiFi-GAN is a generative adversarial network (GAN) that synthesizes high-fidelity speech waveforms from mel-spectrogram inputs. It serves as the vocoder stage in a two-stage TTS pipeline, converting the mel-spectrograms produced by an acoustic model (such as Tacotron2) into audible waveforms. HiFi-GAN achieves synthesis quality comparable to human speech while maintaining fast inference speed.

Architecture

Generator

The generator converts a mel-spectrogram (80 channels) into a raw waveform (1 channel) through progressive upsampling:

Transposed Convolution Upsampling

A sequence of transposed 1D convolutions increases the temporal resolution from mel-spectrogram frame rate to waveform sample rate. With the default configuration:

Upsample factors: [8, 8, 2, 2] (total: 256x, matching the hop length)
Upsample kernel sizes: [16, 16, 4, 4]
Initial channel count: 512, halving at each stage

Multi-Receptive Field Fusion (MRF)

After each upsampling stage, a Multi-Receptive Field Fusion module applies multiple parallel residual blocks with different kernel sizes and dilation rates:

Kernel sizes: [3, 7, 11]
Dilation sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]

Each residual block captures patterns at a different temporal scale. The outputs are summed to produce a rich multi-scale representation. This is the key innovation of HiFi-GAN: modeling periodic patterns at multiple scales simultaneously.

Output Convolution

A final 1D convolution with tanh activation produces the single-channel waveform output.

Discriminator

HiFi-GAN uses a composite discriminator consisting of two types of sub-discriminators:

Multi-Period Discriminator (MPD)

The MPD reshapes the 1D waveform into 2D representations at different periods (2, 3, 5, 7, 11) and applies 2D convolutions. This captures periodic structures in speech at different fundamental frequencies.

Multi-Scale Discriminator (MSD)

The MSD operates on the raw waveform and its downsampled versions (2x, 4x), using 1D convolutions to evaluate audio quality at different temporal scales. This captures both fine-grained sample-level details and coarse spectral patterns.

Both discriminators produce multiple real/fake scores and intermediate features used in the training losses.

Training Losses

HiFi-GAN training combines three categories of losses:

Generator Losses

L1 Mel-Spectrogram Reconstruction Loss (l1_spec_loss): Computes the L1 distance between the mel-spectrogram of the generated waveform and the target mel-spectrogram. This provides a strong supervision signal for spectral accuracy. Weight: 45.

Feature Matching Loss (feat_match_loss): Computes the L1 distance between intermediate features of the discriminator for real and generated audio. This encourages the generator to match the discriminator's internal representations. Weight: 10.

Adversarial Loss (mseg_loss): Mean Squared Error between discriminator scores for generated audio and the target value of 1 (real). Encourages the generator to fool the discriminator. Weight: 1.

Discriminator Loss

MSE Discriminator Loss (msed_loss): Mean Squared Error between discriminator scores for real audio (target: 1) and generated audio (target: 0). Trains the discriminator to distinguish real from generated speech.

Adversarial Training Strategy

The generator and discriminator are trained alternately within each batch:

Step 1 - Train discriminator: Forward pass through generator (detached), compute discriminator loss on real vs. generated audio, update discriminator weights
Step 2 - Train generator: Re-score generated audio with updated discriminator, compute generator losses (adversarial + feature matching + mel reconstruction), update generator weights

This alternating approach prevents mode collapse and ensures the discriminator provides useful gradients to the generator throughout training.

Training Configuration

Key hyperparameters from the SpeechBrain recipe:

Parameter	Value	Description
Epochs	100	Total training epochs
Batch size	64	Training batch size
Segment size	8192	Waveform segment length in samples (~0.5s at 16kHz)
Learning rate	0.0002	For both generator and discriminator
Optimizer	AdamW	Betas: (0.8, 0.99)
LR schedule	ExponentialLR	Gamma: 0.9999
Sample rate	16000 Hz	Audio sample rate
Generator version	V1	512 initial channels, resblock type "1"

Segment-Based Training

Unlike the acoustic model which processes full utterances, the vocoder trains on fixed-length audio segments (8192 samples at 16 kHz, approximately 0.5 seconds). This is done because:

Full-utterance training would require excessive memory for waveform generation
Random segment extraction provides data augmentation
Short segments are sufficient for learning local waveform patterns

The segment field in the data manifest controls whether random cropping is applied.

Inference

At inference time, the generator processes the complete mel-spectrogram in a single forward pass (no segmentation). Weight normalization, used during training for stability, is removed before inference to avoid artifacts.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment