Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain Prepare Common Voice For Whisper

From Leeroopedia


Field Value
API prepare_common_voice(data_folder, save_folder, train_tsv_file=None, dev_tsv_file=None, test_tsv_file=None, accented_letters=False, language="en", skip_prep=False, convert_to_wav=False)
Source recipes/CommonVoice/common_voice_prepare.py:L30-153
Import from common_voice_prepare import prepare_common_voice
Type API Doc (same API as CTC, different context for Whisper fine-tuning)
Inputs CommonVoice TSV files (train.tsv, dev.tsv, test.tsv)
Outputs CSV files (train.csv, dev.csv, test.csv) with columns: ID, duration, wav, spk_id, wrd
Related Principle Principle:Speechbrain_Speechbrain_Whisper_Dataset_Preparation

Purpose

Converts Mozilla Common Voice TSV files into SpeechBrain-compatible CSV manifests for Whisper fine-tuning. This is the same function used for CTC-based ASR data preparation, but when used for Whisper fine-tuning, the language parameter must match the Whisper model's target language for proper tokenization, and accented_letters should be set to True for languages with diacritical marks.

Signature

def prepare_common_voice(
    data_folder,
    save_folder,
    train_tsv_file=None,
    dev_tsv_file=None,
    test_tsv_file=None,
    accented_letters=False,
    language="en",
    skip_prep=False,
    convert_to_wav=False,
):

Parameters

Parameter Type Default Description
data_folder str required Path to the Common Voice dataset folder for the target language (e.g., /datasets/CommonVoice/fr/)
save_folder str required Directory where the output CSV files will be written
train_tsv_file str None Path to the training TSV file. Defaults to data_folder/train.tsv
dev_tsv_file str None Path to the development TSV file. Defaults to data_folder/dev.tsv
test_tsv_file str None Path to the test TSV file. Defaults to data_folder/test.tsv
accented_letters bool False If True, accented characters are preserved. Must be True for Whisper fine-tuning on languages with diacritics (e.g., French, German)
language str "en" Language code for text normalization. Must match the Whisper model's language parameter
skip_prep bool False If True, skips data preparation entirely (useful for resuming training)
convert_to_wav bool False If True, converts .mp3 files to .wav using ffmpeg for faster decoding

Usage Example (Whisper Fine-Tuning Context)

from common_voice_prepare import prepare_common_voice

# For French Whisper fine-tuning:
# - language="fr" must match Whisper(language="fr")
# - accented_letters=True to preserve French diacritics
prepare_common_voice(
    data_folder="/datasets/CommonVoice/fr",
    save_folder="results/whisper_fr/save",
    train_tsv_file="/datasets/CommonVoice/fr/train.tsv",
    dev_tsv_file="/datasets/CommonVoice/fr/dev.tsv",
    test_tsv_file="/datasets/CommonVoice/fr/test.tsv",
    accented_letters=True,
    language="fr",
    skip_prep=False,
)

YAML Configuration (Whisper Recipe)

In the Whisper fine-tuning YAML (train_hf_whisper.yaml), the preparation is configured as:

language: fr
data_folder: !PLACEHOLDER
accented_letters: True
skip_prep: False
train_tsv_file: !ref <data_folder>/train.tsv
dev_tsv_file: !ref <data_folder>/dev.tsv
test_tsv_file: !ref <data_folder>/test.tsv

And invoked in the training script via:

from speechbrain.utils.distributed import run_on_main

run_on_main(
    prepare_common_voice,
    kwargs={
        "data_folder": hparams["data_folder"],
        "save_folder": hparams["save_folder"],
        "train_tsv_file": hparams["train_tsv_file"],
        "dev_tsv_file": hparams["dev_tsv_file"],
        "test_tsv_file": hparams["test_tsv_file"],
        "accented_letters": hparams["accented_letters"],
        "language": hparams["language"],
        "skip_prep": hparams["skip_prep"],
    },
)

Output Format

The generated CSV files have the following structure:

ID,duration,wav,spk_id,wrd
common_voice_fr_17299384,4.56,/datasets/CommonVoice/fr/clips/common_voice_fr_17299384.mp3,abc123def,IL FAUT BIEN COMMENCER
common_voice_fr_17299385,3.21,/datasets/CommonVoice/fr/clips/common_voice_fr_17299385.mp3,xyz789ghi,BONJOUR LE MONDE

Internal Processing

The function performs the following steps for each TSV line:

  1. Parses the TSV columns to extract client_id, audio path, and sentence text.
  2. Constructs the full audio path: data_folder/clips/filename.
  3. Optionally converts .mp3 to .wav using ffmpeg.
  4. Reads audio metadata to compute duration in seconds.
  5. Applies Unicode normalization to the transcript text.
  6. Applies language-specific text preprocessing (via language_specific_preprocess).
  7. Optionally strips accented characters (if accented_letters=False).
  8. Removes multiple spaces and trims whitespace.
  9. Filters out utterances shorter than 3 words (or 3 characters for CJK languages).
  10. Writes the result as a CSV row.

Processing is parallelized via speechbrain.utils.parallel.parallel_map for efficiency on large datasets.

Key Difference for Whisper

When using this function for Whisper fine-tuning rather than CTC ASR:

  • accented_letters should typically be True because Whisper's tokenizer can handle accented characters natively, and stripping them would create a mismatch between training text and Whisper's expected input.
  • language must be set to the exact language code that matches the Whisper model's language configuration (e.g., "fr" for French, "de" for German).

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment