Heuristic:Facebookresearch Audiocraft Audio Normalization Strategies
| Knowledge Sources | |
|---|---|
| Domains | Audio_Processing, Audio_Generation |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Audio normalization strategies with specific headroom values (peak=1dB, RMS=18dB, loudness=14dB) and a silence energy floor of 2e-3 to prevent amplification artifacts.
Description
AudioCraft provides three normalization strategies for output audio: peak normalization (clip headroom), RMS normalization (loudness headroom), and ITU-R BS.1770-4 loudness normalization. Each strategy uses carefully chosen headroom values to prevent clipping while maintaining perceptual loudness. A critical detail is the energy floor of 2e-3: any audio with RMS energy below this threshold is considered silence and is not rescaled, preventing massive gain amplification on near-silent signals.
Additionally, when seeking within MP3 files, AudioCraft applies a 0.1-second negative offset to avoid edge artifacts from the MP3 decoder, which is not frame-accurate for seeking.
Usage
Apply these strategies when writing generated audio to disk via audio_write() or when processing audio inputs for conditioning. Incorrect normalization can cause clipping (too loud), inaudible output (too quiet), or numerical instability (amplifying silence).
The Insight (Rule of Thumb)
- Action: Choose normalization strategy based on use case:
peak_clipfor maximum loudness without clipping,rmsfor consistent perceived loudness,loudnessfor broadcast-standard levels. - Value: Peak headroom = 1 dB (scale factor ~0.89), RMS headroom = 18 dB, Loudness target = -14 LUFS. Energy floor = 2e-3 RMS.
- Trade-off: RMS headroom is intentionally much larger (18 dB vs 1 dB) than peak headroom because RMS normalization does not account for peak transients. Using a small RMS headroom causes clipping.
- MP3 seeking: Always seek 0.1 seconds before the target position to avoid decoder edge artifacts.
Reasoning
The headroom values are tuned for audio generation quality:
- Peak headroom (1 dB): Leaves minimal headroom since peak normalization directly limits the maximum sample value. The scale factor is
10^(-1/20) ≈ 0.89. - RMS headroom (18 dB): Must be much larger because audio with high peak-to-RMS ratio (crest factor) can clip even after RMS normalization. Music typically has 10-20 dB crest factor.
- Loudness (14 LUFS): Standard broadcast target per ITU-R BS.1770-4. Signals with energy below the floor (2e-3) are left untouched to avoid amplifying quantization noise or silence into full-scale artifacts.
- MP3 offset: MP3 is a frame-based codec with variable-length frames. Seeking to an exact sample is impossible; the 0.1s negative offset ensures the decoder has enough context to produce clean audio at the target position.
Code Evidence
Energy floor for silence detection from audiocraft/data/audio_utils.py:62-88:
def normalize_loudness(wav: torch.Tensor, sample_rate: int, loudness_headroom_db: float = 14,
loudness_compressor: bool = False, energy_floor: float = 2e-3):
"""Normalize audio loudness using ITU-R BS.1770-4."""
energy = wav.pow(2).mean().sqrt().item()
if energy < energy_floor:
return wav
# ... loudness normalization continues
Headroom defaults from audiocraft/data/audio_utils.py:131-152:
def normalize_audio(wav: torch.Tensor, normalize: bool = True,
strategy: str = 'peak', peak_clip_headroom_db: float = 1,
rms_headroom_db: float = 18, loudness_headroom_db: float = 14,
...):
MP3 seeking offset from audiocraft/data/audio.py:89-91:
# we need a small negative offset otherwise we get some edge artifact
# from the mp3 decoder.
af.seek(int(max(0, (seek_time - 0.1)) / stream.time_base), stream=stream)
Format-specific reading strategy from audiocraft/data/audio.py:129:
if fp.suffix in ['.flac', '.ogg']: # TODO: check if we can safely use av_read for .ogg
return soundfile_read(filepath, seek_time, duration, pad=pad)