Implementation:Speechbrain Speechbrain Dynamic Mix Data Prep

Field	Value
Implementation Name	Dynamic_Mix_Data_Prep
API	`dynamic_mix_data_prep_librimix(hparams)`
Source	`recipes/LibriMix/separation/dynamic_mixing.py:L83-234`
Import	`from dynamic_mixing import dynamic_mix_data_prep_librimix`
Type	API Doc
Related Principle	Principle:Speechbrain_Speechbrain_Dynamic_Mixing_Augmentation

Purpose

The dynamic_mix_data_prep_librimix function creates a PyTorch DataLoader that generates speech mixtures on-the-fly during training. Instead of loading pre-mixed audio files, it randomly selects individual speaker utterances, normalizes their loudness, and sums them to create novel mixtures at each iteration.

Function Signature

def dynamic_mix_data_prep_librimix(hparams):

Parameters

Parameter	Type	Description
`hparams`	dict	Configuration dictionary containing all required keys (see below)

Required Keys in hparams

Key	Type	Description
`train_data`	str	Path to the training CSV manifest file (used to define epoch length)
`data_folder`	str	Path to the LibriMix data folder (used for WHAM! noise paths)
`base_folder_dm`	str	Path to the clean single-speaker audio directory (e.g., LibriSpeech train-clean-360, pre-processed to target sample rate)
`sample_rate`	int	Target sample rate (e.g., 8000 or 16000)
`num_spks`	int	Number of speakers per mixture (2 or 3)
`training_signal_len`	int	Maximum length of training signals in samples
`dataloader_opts`	dict	Dictionary with keys `batch_size` and `num_workers`
`use_wham_noise`	bool	Whether to add WHAM! noise to the mixtures

Inputs

Single-speaker audio files: WAV files organized in LibriSpeech directory structure under base_folder_dm ({speaker_id}/{chapter_id}/{utterance}.wav)
WHAM! noise files (optional): WAV files in the LibriMix noise directory
Training CSV manifest: Used to define the number of examples per epoch (the actual audio paths in the CSV are not used for mixing)

Outputs

Returns a torch.utils.data.DataLoader that yields PaddedBatch objects with the following keys:

Key	Type	Description
`id`	list[str]	Utterance identifiers (from original CSV)
`mix_sig`	(Tensor, Tensor)	Dynamically mixed signal and lengths
`s1_sig`	(Tensor, Tensor)	First speaker source signal and lengths
`s2_sig`	(Tensor, Tensor)	Second speaker source signal and lengths
`s3_sig`	(Tensor, NoneType)	Third speaker source (or None for 2-speaker)
`noise_sig`	(Tensor, NoneType)	Noise signal (or None if not using WHAM!)

Internal Architecture

Speaker Hashtable Construction

The function first builds a speaker-to-utterance lookup table via build_spk_hashtable_librimix(hparams):

spk_hashtable, spk_weights = build_spk_hashtable_librimix(hparams)

This scans all WAV files under base_folder_dm, extracts speaker IDs from the directory path, and creates a dictionary mapping each speaker ID to a list of their utterance file paths. Speaker weights are proportional to the number of utterances per speaker.

Dynamic Audio Pipeline

The core mixing logic is implemented as a SpeechBrain dynamic audio pipeline:

@sb.utils.data_pipeline.takes("mix_wav")
@sb.utils.data_pipeline.provides(
    "mix_sig", "s1_sig", "s2_sig", "s3_sig", "noise_sig"
)
def audio_pipeline(mix_wav):
    # 1. Randomly select speakers (weighted by utterance count)
    speakers = np.random.choice(
        spk_list, hparams["num_spks"], replace=False, p=spk_weights
    )

    # 2. Select one random utterance per speaker
    spk_files = [
        np.random.choice(spk_hashtable[spk], 1, False)[0]
        for spk in speakers
    ]

    # 3. Determine minimum length
    minlen = min(
        *[torchaudio.info(x).num_frames for x in spk_files],
        hparams["training_signal_len"],
    )

    # 4. Normalize loudness and create mixture
    # ... (loudness normalization via pyloudnorm)

Loudness Normalization

Each source is normalized using the pyloudnorm library (ITU-R BS.1770):

meter = pyloudnorm.Meter(hparams["sample_rate"])
MAX_AMP = 0.9
MIN_LOUDNESS = -33
MAX_LOUDNESS = -25

def normalize(signal, is_noise=False):
    c_loudness = meter.integrated_loudness(signal)
    if is_noise:
        target_loudness = random.uniform(MIN_LOUDNESS - 5, MAX_LOUDNESS - 5)
    else:
        target_loudness = random.uniform(MIN_LOUDNESS, MAX_LOUDNESS)
    signal = pyloudnorm.normalize.loudness(signal, c_loudness, target_loudness)
    if np.max(np.abs(signal)) >= 1:
        signal = signal * MAX_AMP / np.max(np.abs(signal))
    return torch.from_numpy(signal)

DataLoader Construction

The function wraps the dataset in a standard PyTorch DataLoader with PaddedBatch collation and independent worker seeding:

train_data = torch.utils.data.DataLoader(
    train_data,
    batch_size=hparams["dataloader_opts"]["batch_size"],
    num_workers=hparams["dataloader_opts"]["num_workers"],
    collate_fn=PaddedBatch,
    worker_init_fn=lambda x: np.random.seed(
        int.from_bytes(os.urandom(4), "little") + x
    ),
)

The worker_init_fn ensures each worker process uses a different random seed derived from system entropy, preventing duplicate mixtures across workers.

Usage Example

from dynamic_mixing import dynamic_mix_data_prep_librimix

dm_hparams = {
    "train_data": "/output/save/libri2mix_train-360.csv",
    "data_folder": "/data/Libri2Mix",
    "base_folder_dm": "/data/LibriSpeech/train-clean-360_processed",
    "sample_rate": 8000,
    "num_spks": 2,
    "training_signal_len": 32000000,
    "dataloader_opts": {"batch_size": 1, "num_workers": 4},
    "use_wham_noise": False,
}

train_loader = dynamic_mix_data_prep_librimix(dm_hparams)

for batch in train_loader:
    mix = batch.mix_sig       # Dynamically created mixture
    s1 = batch.s1_sig         # First source
    s2 = batch.s2_sig         # Second source
    break

Key Implementation Details

The mix_wav input to the pipeline is a dummy -- the actual mixing is performed dynamically, but the CSV is used to define the epoch length
Sources are randomly cropped to minlen (the minimum of all selected utterance lengths and training_signal_len) with a random start offset
After summing sources, if the mixture exceeds MAX_AMP (0.9), both the mixture and all sources are scaled down by the same factor to maintain signal consistency
For 2-speaker mode, s3_sig yields None; for non-WHAM mode, noise_sig yields None

Dependencies

pyloudnorm: ITU-R BS.1770 loudness normalization
torchaudio: Audio loading and metadata inspection
speechbrain.dataio.batch.PaddedBatch: Batch collation with padding

Source File

recipes/LibriMix/separation/dynamic_mixing.py

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment