Heuristic:Speechbrain Speechbrain Dynamic Mixing Parameters

Knowledge Sources	SpeechBrain WSJ0Mix Dataset Statistics
Domains	Speech_Separation, Data_Augmentation
Last Updated	2026-02-09 20:00 GMT

Overview

Empirically-derived loudness normalization parameters for dynamic mixing augmentation, matching the WSJ0Mix dataset distribution with a 0.9 peak headroom factor.

Description

Dynamic mixing augmentation creates on-the-fly speech mixtures from individual source utterances. The loudness levels must match the distribution of the original training dataset (WSJ0Mix) for effective training. The first speaker's gain is drawn from a Gaussian distribution N(-27.43, 2.57) dB, and subsequent speakers are offset from the first by N(-2.51, 2.66) dB. After mixing, the entire signal is peak-normalized with a 0.9 headroom factor to prevent digital clipping.

Usage

Apply when using dynamic mixing augmentation in separation recipes (WSJ0Mix, LibriMix, Aishell1Mix, BinauralWSJ0Mix). These parameters are hardcoded in the dynamic_mixing.py scripts.

The Insight (Rule of Thumb)

Action: Use gain = N(-27.43, 2.57) clipped to [-45, 0] dB for first speaker. Use relative gain = N(-2.51, 2.66) dB for additional speakers. Apply peak normalization with factor 0.9.
Value: First speaker: mean=-27.43 dB, std=2.57 dB. Relative offset: mean=-2.51 dB, std=2.66 dB. Peak headroom: 0.9.
Trade-off: Using incorrect loudness statistics creates unrealistic mixtures that degrade separation performance. The 0.9 headroom prevents clipping but slightly reduces dynamic range.

Reasoning

These specific values (-27.43 dB mean, 2.57 std for primary; -2.51 dB offset, 2.66 std for secondary) were derived from the loudness statistics of the WSJ0 dataset. The relative offset of -2.51 dB means the second speaker is on average slightly quieter, which is realistic for conversational scenarios. The 0.9 peak headroom factor (rather than 1.0) prevents digital clipping artifacts that would confuse the separation model.

Code from `recipes/WSJ0Mix/separation/dynamic_mixing.py:151-185`:

if i == 0:
    gain = np.clip(random.normalvariate(-27.43, 2.57), -45, 0)
    tmp = rescale(tmp, torch.tensor(len(tmp)), gain, scale="dB")
    first_lvl = gain
else:
    gain = np.clip(
        first_lvl + random.normalvariate(-2.51, 2.66), -45, 0
    )
    tmp = rescale(tmp, torch.tensor(len(tmp)), gain, scale="dB")

# Peak normalization with headroom
max_amp = max(
    torch.abs(mixture).max().item(),
    *[x.item() for x in torch.abs(sources).max(dim=-1)[0]],
)
mix_scaling = 1 / max_amp * 0.9
sources = mix_scaling * sources
mixture = mix_scaling * mixture

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment