Implementation:Speechbrain Speechbrain Prepare Common Voice For Whisper

Field	Value
API	prepare_common_voice(data_folder, save_folder, train_tsv_file=None, dev_tsv_file=None, test_tsv_file=None, accented_letters=False, language="en", skip_prep=False, convert_to_wav=False)
Source	recipes/CommonVoice/common_voice_prepare.py:L30-153
Import	from common_voice_prepare import prepare_common_voice
Type	API Doc (same API as CTC, different context for Whisper fine-tuning)
Inputs	CommonVoice TSV files (train.tsv, dev.tsv, test.tsv)
Outputs	CSV files (train.csv, dev.csv, test.csv) with columns: ID, duration, wav, spk_id, wrd
Related Principle	Principle:Speechbrain_Speechbrain_Whisper_Dataset_Preparation

Purpose

Converts Mozilla Common Voice TSV files into SpeechBrain-compatible CSV manifests for Whisper fine-tuning. This is the same function used for CTC-based ASR data preparation, but when used for Whisper fine-tuning, the language parameter must match the Whisper model's target language for proper tokenization, and accented_letters should be set to True for languages with diacritical marks.

Signature

def prepare_common_voice(
    data_folder,
    save_folder,
    train_tsv_file=None,
    dev_tsv_file=None,
    test_tsv_file=None,
    accented_letters=False,
    language="en",
    skip_prep=False,
    convert_to_wav=False,
):

Parameters

Parameter	Type	Default	Description
data_folder	str	required	Path to the Common Voice dataset folder for the target language (e.g., /datasets/CommonVoice/fr/)
save_folder	str	required	Directory where the output CSV files will be written
train_tsv_file	str	None	Path to the training TSV file. Defaults to data_folder/train.tsv
dev_tsv_file	str	None	Path to the development TSV file. Defaults to data_folder/dev.tsv
test_tsv_file	str	None	Path to the test TSV file. Defaults to data_folder/test.tsv
accented_letters	bool	False	If True, accented characters are preserved. Must be True for Whisper fine-tuning on languages with diacritics (e.g., French, German)
language	str	"en"	Language code for text normalization. Must match the Whisper model's language parameter
skip_prep	bool	False	If True, skips data preparation entirely (useful for resuming training)
convert_to_wav	bool	False	If True, converts .mp3 files to .wav using ffmpeg for faster decoding

Usage Example (Whisper Fine-Tuning Context)

from common_voice_prepare import prepare_common_voice

# For French Whisper fine-tuning:
# - language="fr" must match Whisper(language="fr")
# - accented_letters=True to preserve French diacritics
prepare_common_voice(
    data_folder="/datasets/CommonVoice/fr",
    save_folder="results/whisper_fr/save",
    train_tsv_file="/datasets/CommonVoice/fr/train.tsv",
    dev_tsv_file="/datasets/CommonVoice/fr/dev.tsv",
    test_tsv_file="/datasets/CommonVoice/fr/test.tsv",
    accented_letters=True,
    language="fr",
    skip_prep=False,
)

YAML Configuration (Whisper Recipe)

In the Whisper fine-tuning YAML (train_hf_whisper.yaml), the preparation is configured as:

language: fr
data_folder: !PLACEHOLDER
accented_letters: True
skip_prep: False
train_tsv_file: !ref <data_folder>/train.tsv
dev_tsv_file: !ref <data_folder>/dev.tsv
test_tsv_file: !ref <data_folder>/test.tsv

And invoked in the training script via:

from speechbrain.utils.distributed import run_on_main

run_on_main(
    prepare_common_voice,
    kwargs={
        "data_folder": hparams["data_folder"],
        "save_folder": hparams["save_folder"],
        "train_tsv_file": hparams["train_tsv_file"],
        "dev_tsv_file": hparams["dev_tsv_file"],
        "test_tsv_file": hparams["test_tsv_file"],
        "accented_letters": hparams["accented_letters"],
        "language": hparams["language"],
        "skip_prep": hparams["skip_prep"],
    },
)

Output Format

The generated CSV files have the following structure:

ID,duration,wav,spk_id,wrd
common_voice_fr_17299384,4.56,/datasets/CommonVoice/fr/clips/common_voice_fr_17299384.mp3,abc123def,IL FAUT BIEN COMMENCER
common_voice_fr_17299385,3.21,/datasets/CommonVoice/fr/clips/common_voice_fr_17299385.mp3,xyz789ghi,BONJOUR LE MONDE

Internal Processing

The function performs the following steps for each TSV line:

Parses the TSV columns to extract client_id, audio path, and sentence text.
Constructs the full audio path: data_folder/clips/filename.
Optionally converts .mp3 to .wav using ffmpeg.
Reads audio metadata to compute duration in seconds.
Applies Unicode normalization to the transcript text.
Applies language-specific text preprocessing (via language_specific_preprocess).
Optionally strips accented characters (if accented_letters=False).
Removes multiple spaces and trims whitespace.
Filters out utterances shorter than 3 words (or 3 characters for CJK languages).
Writes the result as a CSV row.

Processing is parallelized via speechbrain.utils.parallel.parallel_map for efficiency on large datasets.

Key Difference for Whisper

When using this function for Whisper fine-tuning rather than CTC ASR:

accented_letters should typically be True because Whisper's tokenizer can handle accented characters natively, and stripping them would create a mismatch between training text and Whisper's expected input.
language must be set to the exact language code that matches the Whisper model's language configuration (e.g., "fr" for French, "de" for German).

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment