Principle:Speechbrain Speechbrain Whisper Dataset Preparation
| Field | Value |
|---|---|
| Concept | Preparing multilingual speech datasets for fine-tuning pretrained Whisper models |
| Domains | Data_Engineering, ASR |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Prepare_Common_Voice_For_Whisper |
Overview
Fine-tuning OpenAI's Whisper model on a specific language or domain requires converting raw speech corpora into the structured CSV manifests that SpeechBrain's data pipeline expects. The prepare_common_voice function handles this conversion, reading Mozilla Common Voice TSV files and producing CSV files with columns for utterance ID, duration, audio path, speaker ID, and transcription text.
Theory
Fine-tuning Whisper requires the same data preparation pipeline as standard CTC-based ASR training: corpus TSV files must be converted into CSV manifests that SpeechBrain's DynamicItemDataset can consume. However, several additional considerations arise when targeting a multilingual model such as Whisper:
- Language specification: The target language must be explicitly specified so that the Whisper tokenizer is configured with the correct language prefix tokens. This is critical because Whisper uses language-specific BPE token prefixes that control decoding behavior.
- Accent handling: The accented_letters parameter controls whether diacritical marks are preserved or stripped. For languages with essential diacritics (e.g., French, German), this must be set to True to avoid destroying linguistic information that Whisper's tokenizer expects.
- Text normalization: Each language has its own normalization rules. The preparation function applies Unicode normalization followed by language-specific preprocessing (e.g., handling of apostrophes in French, special characters in German, or Arabic script filtering). This normalization must be compatible with Whisper's internal text normalizer to avoid train-test mismatch.
- Minimum sentence length: Short utterances (fewer than 3 words, or fewer than 3 characters for CJK languages) are filtered out to avoid degenerate training examples.
Data Flow
The data preparation pipeline follows this sequence:
- Input: Mozilla Common Voice TSV files (train.tsv, dev.tsv, test.tsv) containing columns for client_id, audio path, and sentence text.
- Audio validation: Each audio file is checked for existence and its duration is computed from the audio metadata.
- Text processing: Transcription text undergoes Unicode normalization, language-specific cleaning, optional accent stripping, and whitespace normalization.
- Filtering: Utterances that are too short or have missing audio files are removed.
- Output: CSV files (train.csv, dev.csv, test.csv) with columns: ID, duration, wav, spk_id, wrd.
Language-Specific Considerations
Different languages require different preprocessing strategies:
- English, French, Italian, Kinyarwanda: Regex-based filtering to retain only alphabetic characters and common punctuation, followed by uppercasing.
- German: Special handling for the Eszett character to preserve its case during uppercasing.
- French: Additional processing for contracted articles and apostrophes (e.g., "L'", "D'", "S'").
- Arabic and Farsi: Character-level filtering based on specific Unicode ranges for Arabic script.
- Irish (ga-IE): Lowercase-based normalization due to nondeterministic behavior of Irish uppercase conversion.
Relationship to Whisper Fine-Tuning
The CSV manifests produced by this preparation step feed directly into the Whisper data tokenization pipeline, where audio is loaded and resampled, and text is tokenized using Whisper's byte-level BPE tokenizer. The language parameter set during preparation must match the language configured on the Whisper model to ensure consistent tokenization throughout training and evaluation.