Implementation:Speechbrain Speechbrain Prepare Common Voice For Whisper
| Field | Value |
|---|---|
| API | prepare_common_voice(data_folder, save_folder, train_tsv_file=None, dev_tsv_file=None, test_tsv_file=None, accented_letters=False, language="en", skip_prep=False, convert_to_wav=False) |
| Source | recipes/CommonVoice/common_voice_prepare.py:L30-153 |
| Import | from common_voice_prepare import prepare_common_voice |
| Type | API Doc (same API as CTC, different context for Whisper fine-tuning) |
| Inputs | CommonVoice TSV files (train.tsv, dev.tsv, test.tsv) |
| Outputs | CSV files (train.csv, dev.csv, test.csv) with columns: ID, duration, wav, spk_id, wrd |
| Related Principle | Principle:Speechbrain_Speechbrain_Whisper_Dataset_Preparation |
Purpose
Converts Mozilla Common Voice TSV files into SpeechBrain-compatible CSV manifests for Whisper fine-tuning. This is the same function used for CTC-based ASR data preparation, but when used for Whisper fine-tuning, the language parameter must match the Whisper model's target language for proper tokenization, and accented_letters should be set to True for languages with diacritical marks.
Signature
def prepare_common_voice(
data_folder,
save_folder,
train_tsv_file=None,
dev_tsv_file=None,
test_tsv_file=None,
accented_letters=False,
language="en",
skip_prep=False,
convert_to_wav=False,
):
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| data_folder | str | required | Path to the Common Voice dataset folder for the target language (e.g., /datasets/CommonVoice/fr/) |
| save_folder | str | required | Directory where the output CSV files will be written |
| train_tsv_file | str | None | Path to the training TSV file. Defaults to data_folder/train.tsv |
| dev_tsv_file | str | None | Path to the development TSV file. Defaults to data_folder/dev.tsv |
| test_tsv_file | str | None | Path to the test TSV file. Defaults to data_folder/test.tsv |
| accented_letters | bool | False | If True, accented characters are preserved. Must be True for Whisper fine-tuning on languages with diacritics (e.g., French, German) |
| language | str | "en" | Language code for text normalization. Must match the Whisper model's language parameter |
| skip_prep | bool | False | If True, skips data preparation entirely (useful for resuming training) |
| convert_to_wav | bool | False | If True, converts .mp3 files to .wav using ffmpeg for faster decoding |
Usage Example (Whisper Fine-Tuning Context)
from common_voice_prepare import prepare_common_voice
# For French Whisper fine-tuning:
# - language="fr" must match Whisper(language="fr")
# - accented_letters=True to preserve French diacritics
prepare_common_voice(
data_folder="/datasets/CommonVoice/fr",
save_folder="results/whisper_fr/save",
train_tsv_file="/datasets/CommonVoice/fr/train.tsv",
dev_tsv_file="/datasets/CommonVoice/fr/dev.tsv",
test_tsv_file="/datasets/CommonVoice/fr/test.tsv",
accented_letters=True,
language="fr",
skip_prep=False,
)
YAML Configuration (Whisper Recipe)
In the Whisper fine-tuning YAML (train_hf_whisper.yaml), the preparation is configured as:
language: fr
data_folder: !PLACEHOLDER
accented_letters: True
skip_prep: False
train_tsv_file: !ref <data_folder>/train.tsv
dev_tsv_file: !ref <data_folder>/dev.tsv
test_tsv_file: !ref <data_folder>/test.tsv
And invoked in the training script via:
from speechbrain.utils.distributed import run_on_main
run_on_main(
prepare_common_voice,
kwargs={
"data_folder": hparams["data_folder"],
"save_folder": hparams["save_folder"],
"train_tsv_file": hparams["train_tsv_file"],
"dev_tsv_file": hparams["dev_tsv_file"],
"test_tsv_file": hparams["test_tsv_file"],
"accented_letters": hparams["accented_letters"],
"language": hparams["language"],
"skip_prep": hparams["skip_prep"],
},
)
Output Format
The generated CSV files have the following structure:
ID,duration,wav,spk_id,wrd
common_voice_fr_17299384,4.56,/datasets/CommonVoice/fr/clips/common_voice_fr_17299384.mp3,abc123def,IL FAUT BIEN COMMENCER
common_voice_fr_17299385,3.21,/datasets/CommonVoice/fr/clips/common_voice_fr_17299385.mp3,xyz789ghi,BONJOUR LE MONDE
Internal Processing
The function performs the following steps for each TSV line:
- Parses the TSV columns to extract client_id, audio path, and sentence text.
- Constructs the full audio path: data_folder/clips/filename.
- Optionally converts .mp3 to .wav using ffmpeg.
- Reads audio metadata to compute duration in seconds.
- Applies Unicode normalization to the transcript text.
- Applies language-specific text preprocessing (via language_specific_preprocess).
- Optionally strips accented characters (if accented_letters=False).
- Removes multiple spaces and trims whitespace.
- Filters out utterances shorter than 3 words (or 3 characters for CJK languages).
- Writes the result as a CSV row.
Processing is parallelized via speechbrain.utils.parallel.parallel_map for efficiency on large datasets.
Key Difference for Whisper
When using this function for Whisper fine-tuning rather than CTC ASR:
- accented_letters should typically be True because Whisper's tokenizer can handle accented characters natively, and stripping them would create a mismatch between training text and Whisper's expected input.
- language must be set to the exact language code that matches the Whisper model's language configuration (e.g., "fr" for French, "de" for German).