Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Dynamic Mix Data Prep

From Leeroopedia


Field Value
Implementation Name Dynamic_Mix_Data_Prep
API dynamic_mix_data_prep_librimix(hparams)
Source recipes/LibriMix/separation/dynamic_mixing.py:L83-234
Import from dynamic_mixing import dynamic_mix_data_prep_librimix
Type API Doc
Related Principle Principle:Speechbrain_Speechbrain_Dynamic_Mixing_Augmentation

Purpose

The dynamic_mix_data_prep_librimix function creates a PyTorch DataLoader that generates speech mixtures on-the-fly during training. Instead of loading pre-mixed audio files, it randomly selects individual speaker utterances, normalizes their loudness, and sums them to create novel mixtures at each iteration.

Function Signature

def dynamic_mix_data_prep_librimix(hparams):

Parameters

Parameter Type Description
hparams dict Configuration dictionary containing all required keys (see below)

Required Keys in hparams

Key Type Description
train_data str Path to the training CSV manifest file (used to define epoch length)
data_folder str Path to the LibriMix data folder (used for WHAM! noise paths)
base_folder_dm str Path to the clean single-speaker audio directory (e.g., LibriSpeech train-clean-360, pre-processed to target sample rate)
sample_rate int Target sample rate (e.g., 8000 or 16000)
num_spks int Number of speakers per mixture (2 or 3)
training_signal_len int Maximum length of training signals in samples
dataloader_opts dict Dictionary with keys batch_size and num_workers
use_wham_noise bool Whether to add WHAM! noise to the mixtures

Inputs

  • Single-speaker audio files: WAV files organized in LibriSpeech directory structure under base_folder_dm ({speaker_id}/{chapter_id}/{utterance}.wav)
  • WHAM! noise files (optional): WAV files in the LibriMix noise directory
  • Training CSV manifest: Used to define the number of examples per epoch (the actual audio paths in the CSV are not used for mixing)

Outputs

Returns a torch.utils.data.DataLoader that yields PaddedBatch objects with the following keys:

Key Type Description
id list[str] Utterance identifiers (from original CSV)
mix_sig (Tensor, Tensor) Dynamically mixed signal and lengths
s1_sig (Tensor, Tensor) First speaker source signal and lengths
s2_sig (Tensor, Tensor) Second speaker source signal and lengths
s3_sig (Tensor, NoneType) Third speaker source (or None for 2-speaker)
noise_sig (Tensor, NoneType) Noise signal (or None if not using WHAM!)

Internal Architecture

Speaker Hashtable Construction

The function first builds a speaker-to-utterance lookup table via build_spk_hashtable_librimix(hparams):

spk_hashtable, spk_weights = build_spk_hashtable_librimix(hparams)

This scans all WAV files under base_folder_dm, extracts speaker IDs from the directory path, and creates a dictionary mapping each speaker ID to a list of their utterance file paths. Speaker weights are proportional to the number of utterances per speaker.

Dynamic Audio Pipeline

The core mixing logic is implemented as a SpeechBrain dynamic audio pipeline:

@sb.utils.data_pipeline.takes("mix_wav")
@sb.utils.data_pipeline.provides(
    "mix_sig", "s1_sig", "s2_sig", "s3_sig", "noise_sig"
)
def audio_pipeline(mix_wav):
    # 1. Randomly select speakers (weighted by utterance count)
    speakers = np.random.choice(
        spk_list, hparams["num_spks"], replace=False, p=spk_weights
    )

    # 2. Select one random utterance per speaker
    spk_files = [
        np.random.choice(spk_hashtable[spk], 1, False)[0]
        for spk in speakers
    ]

    # 3. Determine minimum length
    minlen = min(
        *[torchaudio.info(x).num_frames for x in spk_files],
        hparams["training_signal_len"],
    )

    # 4. Normalize loudness and create mixture
    # ... (loudness normalization via pyloudnorm)

Loudness Normalization

Each source is normalized using the pyloudnorm library (ITU-R BS.1770):

meter = pyloudnorm.Meter(hparams["sample_rate"])
MAX_AMP = 0.9
MIN_LOUDNESS = -33
MAX_LOUDNESS = -25

def normalize(signal, is_noise=False):
    c_loudness = meter.integrated_loudness(signal)
    if is_noise:
        target_loudness = random.uniform(MIN_LOUDNESS - 5, MAX_LOUDNESS - 5)
    else:
        target_loudness = random.uniform(MIN_LOUDNESS, MAX_LOUDNESS)
    signal = pyloudnorm.normalize.loudness(signal, c_loudness, target_loudness)
    if np.max(np.abs(signal)) >= 1:
        signal = signal * MAX_AMP / np.max(np.abs(signal))
    return torch.from_numpy(signal)

DataLoader Construction

The function wraps the dataset in a standard PyTorch DataLoader with PaddedBatch collation and independent worker seeding:

train_data = torch.utils.data.DataLoader(
    train_data,
    batch_size=hparams["dataloader_opts"]["batch_size"],
    num_workers=hparams["dataloader_opts"]["num_workers"],
    collate_fn=PaddedBatch,
    worker_init_fn=lambda x: np.random.seed(
        int.from_bytes(os.urandom(4), "little") + x
    ),
)

The worker_init_fn ensures each worker process uses a different random seed derived from system entropy, preventing duplicate mixtures across workers.

Usage Example

from dynamic_mixing import dynamic_mix_data_prep_librimix

dm_hparams = {
    "train_data": "/output/save/libri2mix_train-360.csv",
    "data_folder": "/data/Libri2Mix",
    "base_folder_dm": "/data/LibriSpeech/train-clean-360_processed",
    "sample_rate": 8000,
    "num_spks": 2,
    "training_signal_len": 32000000,
    "dataloader_opts": {"batch_size": 1, "num_workers": 4},
    "use_wham_noise": False,
}

train_loader = dynamic_mix_data_prep_librimix(dm_hparams)

for batch in train_loader:
    mix = batch.mix_sig       # Dynamically created mixture
    s1 = batch.s1_sig         # First source
    s2 = batch.s2_sig         # Second source
    break

Key Implementation Details

  • The mix_wav input to the pipeline is a dummy -- the actual mixing is performed dynamically, but the CSV is used to define the epoch length
  • Sources are randomly cropped to minlen (the minimum of all selected utterance lengths and training_signal_len) with a random start offset
  • After summing sources, if the mixture exceeds MAX_AMP (0.9), both the mixture and all sources are scaled down by the same factor to maintain signal consistency
  • For 2-speaker mode, s3_sig yields None; for non-WHAM mode, noise_sig yields None

Dependencies

  • pyloudnorm: ITU-R BS.1770 loudness normalization
  • torchaudio: Audio loading and metadata inspection
  • speechbrain.dataio.batch.PaddedBatch: Batch collation with padding

Source File

recipes/LibriMix/separation/dynamic_mixing.py

See Also

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment