Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain HiFi GAN Vocoder Training

From Leeroopedia


Property Value
Concept Training GAN-based neural vocoders that convert mel-spectrograms to high-fidelity waveforms
Domains Text_to_Speech, GAN_Training
Repository speechbrain/speechbrain
Knowledge Sources Kong et al. 2020 "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis"
Related Implementation Implementation:Speechbrain_Speechbrain_HifiGanBrain_Fit_Batch

Overview

HiFi-GAN is a generative adversarial network (GAN) that synthesizes high-fidelity speech waveforms from mel-spectrogram inputs. It serves as the vocoder stage in a two-stage TTS pipeline, converting the mel-spectrograms produced by an acoustic model (such as Tacotron2) into audible waveforms. HiFi-GAN achieves synthesis quality comparable to human speech while maintaining fast inference speed.

Architecture

Generator

The generator converts a mel-spectrogram (80 channels) into a raw waveform (1 channel) through progressive upsampling:

Transposed Convolution Upsampling

A sequence of transposed 1D convolutions increases the temporal resolution from mel-spectrogram frame rate to waveform sample rate. With the default configuration:

  • Upsample factors: [8, 8, 2, 2] (total: 256x, matching the hop length)
  • Upsample kernel sizes: [16, 16, 4, 4]
  • Initial channel count: 512, halving at each stage

Multi-Receptive Field Fusion (MRF)

After each upsampling stage, a Multi-Receptive Field Fusion module applies multiple parallel residual blocks with different kernel sizes and dilation rates:

  • Kernel sizes: [3, 7, 11]
  • Dilation sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]

Each residual block captures patterns at a different temporal scale. The outputs are summed to produce a rich multi-scale representation. This is the key innovation of HiFi-GAN: modeling periodic patterns at multiple scales simultaneously.

Output Convolution

A final 1D convolution with tanh activation produces the single-channel waveform output.

Discriminator

HiFi-GAN uses a composite discriminator consisting of two types of sub-discriminators:

Multi-Period Discriminator (MPD)

The MPD reshapes the 1D waveform into 2D representations at different periods (2, 3, 5, 7, 11) and applies 2D convolutions. This captures periodic structures in speech at different fundamental frequencies.

Multi-Scale Discriminator (MSD)

The MSD operates on the raw waveform and its downsampled versions (2x, 4x), using 1D convolutions to evaluate audio quality at different temporal scales. This captures both fine-grained sample-level details and coarse spectral patterns.

Both discriminators produce multiple real/fake scores and intermediate features used in the training losses.

Training Losses

HiFi-GAN training combines three categories of losses:

Generator Losses

  1. L1 Mel-Spectrogram Reconstruction Loss (l1_spec_loss): Computes the L1 distance between the mel-spectrogram of the generated waveform and the target mel-spectrogram. This provides a strong supervision signal for spectral accuracy. Weight: 45.
  1. Feature Matching Loss (feat_match_loss): Computes the L1 distance between intermediate features of the discriminator for real and generated audio. This encourages the generator to match the discriminator's internal representations. Weight: 10.
  1. Adversarial Loss (mseg_loss): Mean Squared Error between discriminator scores for generated audio and the target value of 1 (real). Encourages the generator to fool the discriminator. Weight: 1.

Discriminator Loss

  • MSE Discriminator Loss (msed_loss): Mean Squared Error between discriminator scores for real audio (target: 1) and generated audio (target: 0). Trains the discriminator to distinguish real from generated speech.

Adversarial Training Strategy

The generator and discriminator are trained alternately within each batch:

  1. Step 1 - Train discriminator: Forward pass through generator (detached), compute discriminator loss on real vs. generated audio, update discriminator weights
  2. Step 2 - Train generator: Re-score generated audio with updated discriminator, compute generator losses (adversarial + feature matching + mel reconstruction), update generator weights

This alternating approach prevents mode collapse and ensures the discriminator provides useful gradients to the generator throughout training.

Training Configuration

Key hyperparameters from the SpeechBrain recipe:

Parameter Value Description
Epochs 100 Total training epochs
Batch size 64 Training batch size
Segment size 8192 Waveform segment length in samples (~0.5s at 16kHz)
Learning rate 0.0002 For both generator and discriminator
Optimizer AdamW Betas: (0.8, 0.99)
LR schedule ExponentialLR Gamma: 0.9999
Sample rate 16000 Hz Audio sample rate
Generator version V1 512 initial channels, resblock type "1"

Segment-Based Training

Unlike the acoustic model which processes full utterances, the vocoder trains on fixed-length audio segments (8192 samples at 16 kHz, approximately 0.5 seconds). This is done because:

  • Full-utterance training would require excessive memory for waveform generation
  • Random segment extraction provides data augmentation
  • Short segments are sufficient for learning local waveform patterns

The segment field in the data manifest controls whether random cropping is applied.

Inference

At inference time, the generator processes the complete mel-spectrogram in a single forward pass (no segmentation). Weight normalization, used during training for stability, is removed before inference to avoid artifacts.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment