Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Speechbrain Speechbrain Dynamic Mixing Parameters

From Leeroopedia





Knowledge Sources
Domains Speech_Separation, Data_Augmentation
Last Updated 2026-02-09 20:00 GMT

Overview

Empirically-derived loudness normalization parameters for dynamic mixing augmentation, matching the WSJ0Mix dataset distribution with a 0.9 peak headroom factor.

Description

Dynamic mixing augmentation creates on-the-fly speech mixtures from individual source utterances. The loudness levels must match the distribution of the original training dataset (WSJ0Mix) for effective training. The first speaker's gain is drawn from a Gaussian distribution N(-27.43, 2.57) dB, and subsequent speakers are offset from the first by N(-2.51, 2.66) dB. After mixing, the entire signal is peak-normalized with a 0.9 headroom factor to prevent digital clipping.

Usage

Apply when using dynamic mixing augmentation in separation recipes (WSJ0Mix, LibriMix, Aishell1Mix, BinauralWSJ0Mix). These parameters are hardcoded in the dynamic_mixing.py scripts.

The Insight (Rule of Thumb)

  • Action: Use gain = N(-27.43, 2.57) clipped to [-45, 0] dB for first speaker. Use relative gain = N(-2.51, 2.66) dB for additional speakers. Apply peak normalization with factor 0.9.
  • Value: First speaker: mean=-27.43 dB, std=2.57 dB. Relative offset: mean=-2.51 dB, std=2.66 dB. Peak headroom: 0.9.
  • Trade-off: Using incorrect loudness statistics creates unrealistic mixtures that degrade separation performance. The 0.9 headroom prevents clipping but slightly reduces dynamic range.

Reasoning

These specific values (-27.43 dB mean, 2.57 std for primary; -2.51 dB offset, 2.66 std for secondary) were derived from the loudness statistics of the WSJ0 dataset. The relative offset of -2.51 dB means the second speaker is on average slightly quieter, which is realistic for conversational scenarios. The 0.9 peak headroom factor (rather than 1.0) prevents digital clipping artifacts that would confuse the separation model.

Code from `recipes/WSJ0Mix/separation/dynamic_mixing.py:151-185`:

if i == 0:
    gain = np.clip(random.normalvariate(-27.43, 2.57), -45, 0)
    tmp = rescale(tmp, torch.tensor(len(tmp)), gain, scale="dB")
    first_lvl = gain
else:
    gain = np.clip(
        first_lvl + random.normalvariate(-2.51, 2.66), -45, 0
    )
    tmp = rescale(tmp, torch.tensor(len(tmp)), gain, scale="dB")

# Peak normalization with headroom
max_amp = max(
    torch.abs(mixture).max().item(),
    *[x.item() for x in torch.abs(sources).max(dim=-1)[0]],
)
mix_scaling = 1 / max_amp * 0.9
sources = mix_scaling * sources
mixture = mix_scaling * mixture

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment