Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Facebookresearch Audiocraft Audio Normalization Strategies

From Leeroopedia
Knowledge Sources
Domains Audio_Processing, Audio_Generation
Last Updated 2026-02-13 23:00 GMT

Overview

Audio normalization strategies with specific headroom values (peak=1dB, RMS=18dB, loudness=14dB) and a silence energy floor of 2e-3 to prevent amplification artifacts.

Description

AudioCraft provides three normalization strategies for output audio: peak normalization (clip headroom), RMS normalization (loudness headroom), and ITU-R BS.1770-4 loudness normalization. Each strategy uses carefully chosen headroom values to prevent clipping while maintaining perceptual loudness. A critical detail is the energy floor of 2e-3: any audio with RMS energy below this threshold is considered silence and is not rescaled, preventing massive gain amplification on near-silent signals.

Additionally, when seeking within MP3 files, AudioCraft applies a 0.1-second negative offset to avoid edge artifacts from the MP3 decoder, which is not frame-accurate for seeking.

Usage

Apply these strategies when writing generated audio to disk via audio_write() or when processing audio inputs for conditioning. Incorrect normalization can cause clipping (too loud), inaudible output (too quiet), or numerical instability (amplifying silence).

The Insight (Rule of Thumb)

  • Action: Choose normalization strategy based on use case: peak_clip for maximum loudness without clipping, rms for consistent perceived loudness, loudness for broadcast-standard levels.
  • Value: Peak headroom = 1 dB (scale factor ~0.89), RMS headroom = 18 dB, Loudness target = -14 LUFS. Energy floor = 2e-3 RMS.
  • Trade-off: RMS headroom is intentionally much larger (18 dB vs 1 dB) than peak headroom because RMS normalization does not account for peak transients. Using a small RMS headroom causes clipping.
  • MP3 seeking: Always seek 0.1 seconds before the target position to avoid decoder edge artifacts.

Reasoning

The headroom values are tuned for audio generation quality:

  • Peak headroom (1 dB): Leaves minimal headroom since peak normalization directly limits the maximum sample value. The scale factor is 10^(-1/20) ≈ 0.89.
  • RMS headroom (18 dB): Must be much larger because audio with high peak-to-RMS ratio (crest factor) can clip even after RMS normalization. Music typically has 10-20 dB crest factor.
  • Loudness (14 LUFS): Standard broadcast target per ITU-R BS.1770-4. Signals with energy below the floor (2e-3) are left untouched to avoid amplifying quantization noise or silence into full-scale artifacts.
  • MP3 offset: MP3 is a frame-based codec with variable-length frames. Seeking to an exact sample is impossible; the 0.1s negative offset ensures the decoder has enough context to produce clean audio at the target position.

Code Evidence

Energy floor for silence detection from audiocraft/data/audio_utils.py:62-88:

def normalize_loudness(wav: torch.Tensor, sample_rate: int, loudness_headroom_db: float = 14,
                       loudness_compressor: bool = False, energy_floor: float = 2e-3):
    """Normalize audio loudness using ITU-R BS.1770-4."""
    energy = wav.pow(2).mean().sqrt().item()
    if energy < energy_floor:
        return wav
    # ... loudness normalization continues

Headroom defaults from audiocraft/data/audio_utils.py:131-152:

def normalize_audio(wav: torch.Tensor, normalize: bool = True,
                    strategy: str = 'peak', peak_clip_headroom_db: float = 1,
                    rms_headroom_db: float = 18, loudness_headroom_db: float = 14,
                    ...):

MP3 seeking offset from audiocraft/data/audio.py:89-91:

# we need a small negative offset otherwise we get some edge artifact
# from the mp3 decoder.
af.seek(int(max(0, (seek_time - 0.1)) / stream.time_base), stream=stream)

Format-specific reading strategy from audiocraft/data/audio.py:129:

if fp.suffix in ['.flac', '.ogg']:  # TODO: check if we can safely use av_read for .ogg
    return soundfile_read(filepath, seek_time, duration, pad=pad)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment