Principle:Speechbrain Speechbrain Mixture Dataset Preparation
| Field | Value |
|---|---|
| Principle Name | Mixture_Dataset_Preparation |
| Domain(s) | Data_Engineering, Speech_Separation |
| Description | Creating multi-speaker mixture datasets from clean single-speaker sources for speech separation |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Prepare_Librimix |
Overview
Speech separation training requires carefully constructed paired data: mixtures of multiple speakers talking simultaneously and the corresponding clean single-speaker sources that comprise each mixture. Without this pairing, a supervised separation model has no ground truth signal to learn from. The Mixture Dataset Preparation principle addresses how to generate such training data from existing clean speech corpora.
Theoretical Foundation
The core idea is to take clean single-speaker utterances from a large corpus such as LibriSpeech and combine them at specified signal-to-noise ratios to produce multi-speaker mixtures. The resulting dataset, known as LibriMix, provides:
- Mixture signals: The sum of two or three overlapping speaker waveforms
- Clean source signals: The individual speaker waveforms that were summed
- Optional noise signals: Additive environmental noise from the WHAM! dataset
Each mixture is deterministically constructed so that every training example maps to a known set of clean sources. This is essential for supervised training because the loss function (e.g., SI-SNR) requires both the predicted separated signals and the ground-truth clean signals.
Data Organization
The LibriMix dataset follows a structured directory layout:
Libri2Mix/
wav8k/min/
train-360/
mix_clean/ # Clean mixtures (s1 + s2)
mix_both/ # Noisy mixtures (s1 + s2 + noise)
s1/ # First speaker source
s2/ # Second speaker source
noise/ # WHAM! noise signals
dev/
...
test/
...
Two-Speaker vs. Three-Speaker Configurations
LibriMix supports both 2-speaker (Libri2Mix) and 3-speaker (Libri3Mix) configurations. The 3-speaker variant adds an additional source directory (s3/) and increases the complexity of the separation task. The preparation pipeline handles both cases by parameterizing the number of speakers.
Sampling Rate Considerations
The dataset can be prepared at different sampling rates (typically 8 kHz or 16 kHz). Lower sampling rates reduce computational cost during training but limit the frequency range of the reconstructed signals. The sampling rate is specified during preparation and must remain consistent throughout the entire training pipeline.
Data Manifest Generation
The preparation step generates CSV manifest files that map each mixture to its constituent sources. These manifests contain:
- ID: Unique identifier for each example
- Duration: Length of the audio segment
- mix_wav: Path to the mixture waveform
- s1_wav, s2_wav: Paths to clean source waveforms
- s3_wav (optional): Path to third speaker source for Libri3Mix
- noise_wav: Path to the WHAM! noise signal
These CSV files serve as the interface between raw audio and the SpeechBrain data loading pipeline, allowing efficient batching and shuffling during training.
Relationship to the Training Pipeline
The mixture dataset preparation is the first step in the speech separation training workflow. The generated CSV manifests are consumed by the DynamicItemDataset class, which lazily loads audio on demand. This decouples data preparation from training, allowing the preparation to run once while training can be repeated with different hyperparameters.
Key Considerations
- Reproducibility: Because mixtures are pre-generated, training is fully reproducible given the same dataset version.
- Storage cost: Pre-mixed datasets require storing both the mixtures and all individual sources, roughly tripling storage requirements for 2-speaker mixes.
- Data diversity: Fixed mixtures limit the number of unique speaker combinations the model sees. This limitation motivates the complementary Dynamic Mixing Augmentation approach.