Heuristic:Speechbrain Speechbrain Dynamic Mixing Parameters
| Knowledge Sources | |
|---|---|
| Domains | Speech_Separation, Data_Augmentation |
| Last Updated | 2026-02-09 20:00 GMT |
Overview
Empirically-derived loudness normalization parameters for dynamic mixing augmentation, matching the WSJ0Mix dataset distribution with a 0.9 peak headroom factor.
Description
Dynamic mixing augmentation creates on-the-fly speech mixtures from individual source utterances. The loudness levels must match the distribution of the original training dataset (WSJ0Mix) for effective training. The first speaker's gain is drawn from a Gaussian distribution N(-27.43, 2.57) dB, and subsequent speakers are offset from the first by N(-2.51, 2.66) dB. After mixing, the entire signal is peak-normalized with a 0.9 headroom factor to prevent digital clipping.
Usage
Apply when using dynamic mixing augmentation in separation recipes (WSJ0Mix, LibriMix, Aishell1Mix, BinauralWSJ0Mix). These parameters are hardcoded in the dynamic_mixing.py scripts.
The Insight (Rule of Thumb)
- Action: Use gain = N(-27.43, 2.57) clipped to [-45, 0] dB for first speaker. Use relative gain = N(-2.51, 2.66) dB for additional speakers. Apply peak normalization with factor 0.9.
- Value: First speaker: mean=-27.43 dB, std=2.57 dB. Relative offset: mean=-2.51 dB, std=2.66 dB. Peak headroom: 0.9.
- Trade-off: Using incorrect loudness statistics creates unrealistic mixtures that degrade separation performance. The 0.9 headroom prevents clipping but slightly reduces dynamic range.
Reasoning
These specific values (-27.43 dB mean, 2.57 std for primary; -2.51 dB offset, 2.66 std for secondary) were derived from the loudness statistics of the WSJ0 dataset. The relative offset of -2.51 dB means the second speaker is on average slightly quieter, which is realistic for conversational scenarios. The 0.9 peak headroom factor (rather than 1.0) prevents digital clipping artifacts that would confuse the separation model.
Code from `recipes/WSJ0Mix/separation/dynamic_mixing.py:151-185`:
if i == 0:
gain = np.clip(random.normalvariate(-27.43, 2.57), -45, 0)
tmp = rescale(tmp, torch.tensor(len(tmp)), gain, scale="dB")
first_lvl = gain
else:
gain = np.clip(
first_lvl + random.normalvariate(-2.51, 2.66), -45, 0
)
tmp = rescale(tmp, torch.tensor(len(tmp)), gain, scale="dB")
# Peak normalization with headroom
max_amp = max(
torch.abs(mixture).max().item(),
*[x.item() for x in torch.abs(sources).max(dim=-1)[0]],
)
mix_scaling = 1 / max_amp * 0.9
sources = mix_scaling * sources
mixture = mix_scaling * mixture