Principle:Speechbrain Speechbrain Dataset Specific Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Speech_Processing, Corpus_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Dataset-specific data preparation is the process of converting raw speech corpus distribution formats into a standardized tabular manifest representation that decouples corpus-specific layouts from downstream training logic.
Description
Speech and audio corpora are distributed in a wide variety of native formats: RTTM annotations for diarization, TSV metadata for crowdsourced datasets, XML transcripts for broadcast speech, Kaldi-style segment files, and many others. Each corpus has its own directory structure, file naming conventions, annotation schema, and audio encoding. Dataset-specific data preparation scripts bridge this gap by parsing corpus-native formats and producing SpeechBrain's standardized CSV or JSON manifest files. These manifests contain uniform columns -- typically ID, duration, wav (audio file path), and task-specific labels such as transcriptions, speaker identifiers, emotion tags, or sound class labels -- enabling any downstream training recipe to consume any prepared corpus without modification.
Usage
Use this principle whenever integrating a new speech or audio corpus into a SpeechBrain training pipeline. Before any model training can begin, a preparation script must be written or invoked that reads the raw corpus files and emits standardized CSV/JSON manifests for each data split (train, dev/valid, test). This is the mandatory first stage of every recipe.
Theoretical Basis
The core algorithmic pattern shared by all dataset preparation scripts follows a deterministic pipeline:
Algorithm: Dataset-Specific Data Preparation
Input: raw corpus directory D, output directory O, split definitions S
Output: CSV/JSON manifest files {M_train, M_dev, M_test}
1. VALIDATE that D contains expected corpus structure
2. CHECK if output manifests already exist in O (idempotency guard)
- If all manifests exist and skip_prep is set, RETURN early
3. For each split s in S:
a. PARSE corpus-native annotation files for split s
- RTTM files -> extract speaker, start_time, duration per segment
- TSV files -> extract audio_path, transcript, metadata per row
- XML files -> extract utterance boundaries, transcripts per element
- Kaldi files -> extract utterance-id, wav-path, transcript per line
- TextGrid -> extract tier intervals with labels
b. For each utterance u discovered:
i. RESOLVE audio file path (absolute or relative to data_root)
ii. COMPUTE duration from audio metadata or annotation timestamps
iii. NORMALIZE text transcription (unicode, casing, punctuation)
iv. FILTER invalid entries (missing audio, empty transcript, too short)
v. ASSIGN unique utterance ID
c. WRITE manifest M_s as CSV with columns:
ID, duration, wav, {task-specific label columns}
4. RETURN paths to generated manifests
Standard Manifest Format
All preparation scripts converge on the same output schema:
ID, duration, wav, spk_id, wrd
utterance_001, 3.42, /data/corpus/wavs/utt001.wav, speaker_A, "hello world"
utterance_002, 5.17, /data/corpus/wavs/utt002.wav, speaker_B, "good morning"
The exact label columns vary by task: ASR manifests include wrd (word-level transcription) and optionally char (character-level); diarization manifests include speaker and start/stop boundaries; classification manifests include class_label or emotion; and language identification manifests include language.
Key Design Properties
- Idempotency -- preparation scripts detect existing outputs and skip redundant processing, ensuring safe re-runs
- DDP Safety -- in distributed training, only rank-0 executes preparation while other processes wait, preventing race conditions on filesystem writes
- Corpus Independence -- downstream training code depends only on the manifest schema, never on raw corpus formats
- Text Normalization -- transcripts undergo unicode normalization, optional accent stripping, language-specific filtering, and whitespace cleanup to ensure tokenizer consistency
- Duration-Based Filtering -- utterances outside acceptable duration bounds are excluded to prevent memory issues and improve batching efficiency
Related Pages
- Implementation:Speechbrain_Speechbrain_Prepare_AMI_Diarization
- Implementation:Speechbrain_Speechbrain_Prepare_AudioMNIST
- Implementation:Speechbrain_Speechbrain_Prepare_CommonLanguage
- Implementation:Speechbrain_Speechbrain_Prepare_DVoice
- Implementation:Speechbrain_Speechbrain_Prepare_ESC50
- Implementation:Speechbrain_Speechbrain_Prepare_Fisher_Callhome
- Implementation:Speechbrain_Speechbrain_Prepare_GigaSpeech
- Implementation:Speechbrain_Speechbrain_Prepare_GSC
- Implementation:Speechbrain_Speechbrain_Prepare_IEMOCAP
- Implementation:Speechbrain_Speechbrain_Prepare_KsponSpeech
- Implementation:Speechbrain_Speechbrain_Prepare_LibriSpeech_LM
- Implementation:Speechbrain_Speechbrain_Prepare_Libriheavy
- Implementation:Speechbrain_Speechbrain_Prepare_MEDIA
- Implementation:Speechbrain_Speechbrain_Prepare_PeoplesSpeech
- Implementation:Speechbrain_Speechbrain_Prepare_RescueSpeech
- Implementation:Speechbrain_Speechbrain_Prepare_Switchboard
- Implementation:Speechbrain_Speechbrain_Prepare_TIMIT
- Implementation:Speechbrain_Speechbrain_Prepare_UrbanSound8k
- Implementation:Speechbrain_Speechbrain_Prepare_Voicebank_CTC
- Implementation:Speechbrain_Speechbrain_Prepare_Voicebank_Revb
- Implementation:Speechbrain_Speechbrain_Prepare_VoxPopuli