Principle:Speechbrain Speechbrain LibriTTS Data Preparation
| Property | Value |
|---|---|
| Concept | Preparing multi-speaker speech synthesis datasets with text-audio alignment |
| Domains | Data_Engineering, Text_to_Speech |
| Repository | speechbrain/speechbrain |
| Source File | recipes/LibriTTS/libritts_prepare.py
|
| Related Implementation | Implementation:Speechbrain_Speechbrain_Prepare_Libritts |
Overview
TTS training requires high-quality text-audio pairs with speaker identity information. The LibriTTS corpus provides clean audiobook recordings with normalized text transcriptions derived from the LibriSpeech dataset. Data preparation transforms raw audio and text into structured manifest files that downstream training pipelines consume.
Theoretical Foundation
Speech synthesis models learn a mapping from textual input to acoustic output. The quality of this mapping depends heavily on the quality and organization of the training data. Key requirements include:
- Text-audio alignment: Each audio file must be paired with its exact transcription. LibriTTS provides sentence-level segmentation with accompanying
.normalized.txtfiles that contain cleaned, normalized text free of abbreviations and special formatting.
- Speaker identity: Multi-speaker TTS models must learn to produce speech in different voices. This requires a speaker identifier for each utterance, enabling the model to condition its output on speaker identity. In LibriTTS, the speaker ID is embedded in the file naming convention:
{speaker_id}_{chapter_id}_{utterance_id}.wav.
- Duration filtering: Very short utterances (under 1 second) provide insufficient context for learning prosody and attention alignment. Very long utterances (over 10 seconds) cause memory issues during training. Filtering ensures training stability.
- Sample rate consistency: Mel-spectrogram computation requires a consistent sample rate across all utterances. LibriTTS provides audio at 24 kHz, but TTS training commonly uses 16 kHz or 22050 Hz depending on the mel-spectrogram configuration. Resampling ensures uniformity.
Data Organization
LibriTTS follows a hierarchical directory structure:
LibriTTS/
train-clean-100/
{speaker_id}/
{chapter_id}/
{speaker_id}_{chapter_id}_{utterance_id}.wav
{speaker_id}_{chapter_id}_{utterance_id}.normalized.txt
train-clean-360/
dev-clean/
test-clean/
Each subset (e.g., train-clean-100, dev-clean) contains a different partition of speakers. The preparation process can either use predefined subsets for train/valid/test splits or randomly split a combined subset list using a configurable ratio.
Manifest Format
The output JSON manifest maps utterance IDs to metadata records:
{
"116_288045_000003_000002": {
"uttid": "116_288045_000003_000002",
"wav": "/data/LibriTTS/train-clean-100/116/288045/116_288045_000003_000002.wav",
"duration": 3.45,
"spk_id": "116",
"label": "The quick brown fox jumped over the lazy dog.",
"segment": true
}
}
The segment field indicates whether random segment extraction should be applied (used during vocoder training to crop fixed-length audio segments from longer utterances).
Split Strategies
Two split strategies are supported:
- Explicit splits: Predefined LibriTTS subsets are assigned to train, valid, and test sets. For example,
train-clean-100for training,dev-cleanfor validation, andtest-cleanfor testing. This ensures no speaker overlap between splits.
- Random splits: A combined list of utterances from specified subsets is randomly partitioned according to a ratio (default: 80/10/10). The random seed is fixed for reproducibility.
Model-Specific Preprocessing
The data preparation can be customized per model:
- Tacotron2: Computes phoneme representations of text labels using a pretrained Grapheme-to-Phoneme (G2P) model from SpeechBrain (
speechbrain/soundchoice-g2p). - HiFi-GAN: Skips phoneme computation since vocoders operate on mel-spectrograms rather than text.
- Other models: Computes both character-level and phoneme-level labels.
Key Considerations
- Reproducibility: The random seed is set before any splitting or shuffling operations
- Idempotency: The preparation checks for existing output files and skips if already completed
- Audio integrity: Files are resampled in-place when the source sample rate does not match the target
- Text cleaning: Curly braces are stripped from normalized text to prevent tokenization issues
See Also
- Implementation:Speechbrain_Speechbrain_Prepare_Libritts - The
prepare_librittsfunction implementing this preparation - Principle:Speechbrain_Speechbrain_Speaker_Embedding_Precomputation - Speaker embeddings computed after data preparation