Implementation:Speechbrain Speechbrain Dynamic Mix Data Prep
| Field | Value |
|---|---|
| Implementation Name | Dynamic_Mix_Data_Prep |
| API | dynamic_mix_data_prep_librimix(hparams)
|
| Source | recipes/LibriMix/separation/dynamic_mixing.py:L83-234
|
| Import | from dynamic_mixing import dynamic_mix_data_prep_librimix
|
| Type | API Doc |
| Related Principle | Principle:Speechbrain_Speechbrain_Dynamic_Mixing_Augmentation |
Purpose
The dynamic_mix_data_prep_librimix function creates a PyTorch DataLoader that generates speech mixtures on-the-fly during training. Instead of loading pre-mixed audio files, it randomly selects individual speaker utterances, normalizes their loudness, and sums them to create novel mixtures at each iteration.
Function Signature
def dynamic_mix_data_prep_librimix(hparams):
Parameters
| Parameter | Type | Description |
|---|---|---|
hparams |
dict | Configuration dictionary containing all required keys (see below) |
Required Keys in hparams
| Key | Type | Description |
|---|---|---|
train_data |
str | Path to the training CSV manifest file (used to define epoch length) |
data_folder |
str | Path to the LibriMix data folder (used for WHAM! noise paths) |
base_folder_dm |
str | Path to the clean single-speaker audio directory (e.g., LibriSpeech train-clean-360, pre-processed to target sample rate) |
sample_rate |
int | Target sample rate (e.g., 8000 or 16000) |
num_spks |
int | Number of speakers per mixture (2 or 3) |
training_signal_len |
int | Maximum length of training signals in samples |
dataloader_opts |
dict | Dictionary with keys batch_size and num_workers
|
use_wham_noise |
bool | Whether to add WHAM! noise to the mixtures |
Inputs
- Single-speaker audio files: WAV files organized in LibriSpeech directory structure under
base_folder_dm({speaker_id}/{chapter_id}/{utterance}.wav) - WHAM! noise files (optional): WAV files in the LibriMix noise directory
- Training CSV manifest: Used to define the number of examples per epoch (the actual audio paths in the CSV are not used for mixing)
Outputs
Returns a torch.utils.data.DataLoader that yields PaddedBatch objects with the following keys:
| Key | Type | Description |
|---|---|---|
id |
list[str] | Utterance identifiers (from original CSV) |
mix_sig |
(Tensor, Tensor) | Dynamically mixed signal and lengths |
s1_sig |
(Tensor, Tensor) | First speaker source signal and lengths |
s2_sig |
(Tensor, Tensor) | Second speaker source signal and lengths |
s3_sig |
(Tensor, NoneType) | Third speaker source (or None for 2-speaker) |
noise_sig |
(Tensor, NoneType) | Noise signal (or None if not using WHAM!) |
Internal Architecture
Speaker Hashtable Construction
The function first builds a speaker-to-utterance lookup table via build_spk_hashtable_librimix(hparams):
spk_hashtable, spk_weights = build_spk_hashtable_librimix(hparams)
This scans all WAV files under base_folder_dm, extracts speaker IDs from the directory path, and creates a dictionary mapping each speaker ID to a list of their utterance file paths. Speaker weights are proportional to the number of utterances per speaker.
Dynamic Audio Pipeline
The core mixing logic is implemented as a SpeechBrain dynamic audio pipeline:
@sb.utils.data_pipeline.takes("mix_wav")
@sb.utils.data_pipeline.provides(
"mix_sig", "s1_sig", "s2_sig", "s3_sig", "noise_sig"
)
def audio_pipeline(mix_wav):
# 1. Randomly select speakers (weighted by utterance count)
speakers = np.random.choice(
spk_list, hparams["num_spks"], replace=False, p=spk_weights
)
# 2. Select one random utterance per speaker
spk_files = [
np.random.choice(spk_hashtable[spk], 1, False)[0]
for spk in speakers
]
# 3. Determine minimum length
minlen = min(
*[torchaudio.info(x).num_frames for x in spk_files],
hparams["training_signal_len"],
)
# 4. Normalize loudness and create mixture
# ... (loudness normalization via pyloudnorm)
Loudness Normalization
Each source is normalized using the pyloudnorm library (ITU-R BS.1770):
meter = pyloudnorm.Meter(hparams["sample_rate"])
MAX_AMP = 0.9
MIN_LOUDNESS = -33
MAX_LOUDNESS = -25
def normalize(signal, is_noise=False):
c_loudness = meter.integrated_loudness(signal)
if is_noise:
target_loudness = random.uniform(MIN_LOUDNESS - 5, MAX_LOUDNESS - 5)
else:
target_loudness = random.uniform(MIN_LOUDNESS, MAX_LOUDNESS)
signal = pyloudnorm.normalize.loudness(signal, c_loudness, target_loudness)
if np.max(np.abs(signal)) >= 1:
signal = signal * MAX_AMP / np.max(np.abs(signal))
return torch.from_numpy(signal)
DataLoader Construction
The function wraps the dataset in a standard PyTorch DataLoader with PaddedBatch collation and independent worker seeding:
train_data = torch.utils.data.DataLoader(
train_data,
batch_size=hparams["dataloader_opts"]["batch_size"],
num_workers=hparams["dataloader_opts"]["num_workers"],
collate_fn=PaddedBatch,
worker_init_fn=lambda x: np.random.seed(
int.from_bytes(os.urandom(4), "little") + x
),
)
The worker_init_fn ensures each worker process uses a different random seed derived from system entropy, preventing duplicate mixtures across workers.
Usage Example
from dynamic_mixing import dynamic_mix_data_prep_librimix
dm_hparams = {
"train_data": "/output/save/libri2mix_train-360.csv",
"data_folder": "/data/Libri2Mix",
"base_folder_dm": "/data/LibriSpeech/train-clean-360_processed",
"sample_rate": 8000,
"num_spks": 2,
"training_signal_len": 32000000,
"dataloader_opts": {"batch_size": 1, "num_workers": 4},
"use_wham_noise": False,
}
train_loader = dynamic_mix_data_prep_librimix(dm_hparams)
for batch in train_loader:
mix = batch.mix_sig # Dynamically created mixture
s1 = batch.s1_sig # First source
s2 = batch.s2_sig # Second source
break
Key Implementation Details
- The
mix_wavinput to the pipeline is a dummy -- the actual mixing is performed dynamically, but the CSV is used to define the epoch length - Sources are randomly cropped to
minlen(the minimum of all selected utterance lengths andtraining_signal_len) with a random start offset - After summing sources, if the mixture exceeds
MAX_AMP(0.9), both the mixture and all sources are scaled down by the same factor to maintain signal consistency - For 2-speaker mode,
s3_sigyieldsNone; for non-WHAM mode,noise_sigyieldsNone
Dependencies
pyloudnorm: ITU-R BS.1770 loudness normalizationtorchaudio: Audio loading and metadata inspectionspeechbrain.dataio.batch.PaddedBatch: Batch collation with padding
Source File
recipes/LibriMix/separation/dynamic_mixing.py
See Also
- Principle:Speechbrain_Speechbrain_Dynamic_Mixing_Augmentation
- Implementation:Speechbrain_Speechbrain_Prepare_Librimix