Principle:Speechbrain Speechbrain LibriTTS Data Preparation

Property	Value
Concept	Preparing multi-speaker speech synthesis datasets with text-audio alignment
Domains	Data_Engineering, Text_to_Speech
Repository	speechbrain/speechbrain
Source File	`recipes/LibriTTS/libritts_prepare.py`
Related Implementation	Implementation:Speechbrain_Speechbrain_Prepare_Libritts

Overview

TTS training requires high-quality text-audio pairs with speaker identity information. The LibriTTS corpus provides clean audiobook recordings with normalized text transcriptions derived from the LibriSpeech dataset. Data preparation transforms raw audio and text into structured manifest files that downstream training pipelines consume.

Theoretical Foundation

Speech synthesis models learn a mapping from textual input to acoustic output. The quality of this mapping depends heavily on the quality and organization of the training data. Key requirements include:

Text-audio alignment: Each audio file must be paired with its exact transcription. LibriTTS provides sentence-level segmentation with accompanying .normalized.txt files that contain cleaned, normalized text free of abbreviations and special formatting.

Speaker identity: Multi-speaker TTS models must learn to produce speech in different voices. This requires a speaker identifier for each utterance, enabling the model to condition its output on speaker identity. In LibriTTS, the speaker ID is embedded in the file naming convention: {speaker_id}_{chapter_id}_{utterance_id}.wav.

Duration filtering: Very short utterances (under 1 second) provide insufficient context for learning prosody and attention alignment. Very long utterances (over 10 seconds) cause memory issues during training. Filtering ensures training stability.

Sample rate consistency: Mel-spectrogram computation requires a consistent sample rate across all utterances. LibriTTS provides audio at 24 kHz, but TTS training commonly uses 16 kHz or 22050 Hz depending on the mel-spectrogram configuration. Resampling ensures uniformity.

Data Organization

LibriTTS follows a hierarchical directory structure:

LibriTTS/
  train-clean-100/
    {speaker_id}/
      {chapter_id}/
        {speaker_id}_{chapter_id}_{utterance_id}.wav
        {speaker_id}_{chapter_id}_{utterance_id}.normalized.txt
  train-clean-360/
  dev-clean/
  test-clean/

Each subset (e.g., train-clean-100, dev-clean) contains a different partition of speakers. The preparation process can either use predefined subsets for train/valid/test splits or randomly split a combined subset list using a configurable ratio.

Manifest Format

The output JSON manifest maps utterance IDs to metadata records:

{
  "116_288045_000003_000002": {
    "uttid": "116_288045_000003_000002",
    "wav": "/data/LibriTTS/train-clean-100/116/288045/116_288045_000003_000002.wav",
    "duration": 3.45,
    "spk_id": "116",
    "label": "The quick brown fox jumped over the lazy dog.",
    "segment": true
  }
}

The segment field indicates whether random segment extraction should be applied (used during vocoder training to crop fixed-length audio segments from longer utterances).

Split Strategies

Two split strategies are supported:

Explicit splits: Predefined LibriTTS subsets are assigned to train, valid, and test sets. For example, train-clean-100 for training, dev-clean for validation, and test-clean for testing. This ensures no speaker overlap between splits.

Random splits: A combined list of utterances from specified subsets is randomly partitioned according to a ratio (default: 80/10/10). The random seed is fixed for reproducibility.

Model-Specific Preprocessing

The data preparation can be customized per model:

Tacotron2: Computes phoneme representations of text labels using a pretrained Grapheme-to-Phoneme (G2P) model from SpeechBrain (speechbrain/soundchoice-g2p).
HiFi-GAN: Skips phoneme computation since vocoders operate on mel-spectrograms rather than text.
Other models: Computes both character-level and phoneme-level labels.

Key Considerations

Reproducibility: The random seed is set before any splitting or shuffling operations
Idempotency: The preparation checks for existing output files and skips if already completed
Audio integrity: Files are resampled in-place when the source sample rate does not match the target
Text cleaning: Curly braces are stripped from normalized text to prevent tokenization issues

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment