Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain Prepare Common Voice

From Leeroopedia
Revision as of 16:44, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Speechbrain_Speechbrain_Prepare_Common_Voice.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Field Value
Implementation Name Prepare_Common_Voice
API Signature prepare_common_voice(data_folder, save_folder, train_tsv_file=None, dev_tsv_file=None, test_tsv_file=None, accented_letters=False, language="en", skip_prep=False, convert_to_wav=False)
Source File recipes/CommonVoice/common_voice_prepare.py:L30-153
Import from common_voice_prepare import prepare_common_voice
Type API Doc
Related Principle Principle:Speechbrain_Speechbrain_Data_Preparation_For_CTC_ASR

Description

The prepare_common_voice function converts Mozilla CommonVoice TSV data files into standardized CSV manifests suitable for SpeechBrain's DynamicItemDataset. It handles the complete data preparation pipeline including file validation, audio duration extraction, text normalization, language-specific preprocessing, and optional MP3-to-WAV conversion.

Inputs

Parameter Type Default Description
data_folder str (required) Path to the folder where the original CommonVoice dataset is stored. Should include the language subfolder, e.g., /datasets/CommonVoice/en/. Must contain a clips/ subdirectory with audio files.
save_folder str (required) Directory where the output CSV files will be written. Created automatically if it does not exist.
train_tsv_file str None Path to the train TSV file. If None, defaults to data_folder + "/train.tsv".
dev_tsv_file str None Path to the dev TSV file. If None, defaults to data_folder + "/dev.tsv".
test_tsv_file str None Path to the test TSV file. If None, defaults to data_folder + "/test.tsv".
accented_letters bool False If True, keep accent marks on characters. If False, strip diacritical marks and convert to closest ASCII equivalents.
language str "en" Target language code (e.g., "en", "fr", "de", "ar", "zh-CN"). Controls language-specific text normalization rules.
skip_prep bool False If True, skip data preparation entirely and return immediately. Useful when resuming training.
convert_to_wav bool False If True, convert MP3 audio files to uncompressed WAV format using ffmpeg. Increases disk usage but may speed up training if CPU-limited during audio decoding.

Outputs

Three CSV files are written to save_folder:

File Description
train.csv Training set manifest
dev.csv Development/validation set manifest
test.csv Test set manifest

Each CSV file contains the following columns:

Column Type Description
ID str Unique utterance identifier derived from the audio filename (without extension)
duration float Duration of the audio clip in seconds, computed from audio metadata
wav str Full path to the audio file (MP3 or WAV depending on convert_to_wav)
spk_id str Speaker identifier from the CommonVoice client_id field
wrd str Normalized transcription text

Key Behaviors

Idempotency

If all three output CSV files already exist, the function logs a message and returns without re-processing. This avoids redundant computation when restarting training.

Text Normalization Pipeline

For each utterance, the text undergoes:

  1. Unicode normalization via unicode_normalisation()
  2. Language-specific preprocessing via language_specific_preprocess() -- applies regex-based character filtering and case transformation specific to the language
  3. Accent stripping (when accented_letters=False) -- removes diacritical marks using Unicode NFD decomposition
  4. Whitespace normalization -- collapses multiple spaces, strips leading/trailing whitespace
  5. Minimum length filtering -- discards utterances with fewer than 3 words (or fewer than 3 characters for Chinese/Japanese)

Parallel Processing

The per-line processing is parallelized using speechbrain.utils.parallel.parallel_map, which distributes the audio metadata reading and text normalization across multiple workers for faster processing of large datasets.

Atomic File Writing

Output CSV files are first written to a temporary file (.tmp suffix) and then atomically renamed using os.replace(). This prevents corrupted partial files from being left behind if the process is interrupted.

Usage Example

from common_voice_prepare import prepare_common_voice

# Prepare the CommonVoice English dataset
prepare_common_voice(
    data_folder="/datasets/CommonVoice/en",
    save_folder="results/commonvoice_ctc/save",
    train_tsv_file="/datasets/CommonVoice/en/train.tsv",
    dev_tsv_file="/datasets/CommonVoice/en/dev.tsv",
    test_tsv_file="/datasets/CommonVoice/en/test.tsv",
    accented_letters=False,
    language="en",
    skip_prep=False,
)

Integration in CTC Training Recipe

In the training script, data preparation is run only on the main process in distributed setups:

from speechbrain.utils.distributed import run_on_main
from common_voice_prepare import prepare_common_voice

run_on_main(
    prepare_common_voice,
    kwargs={
        "data_folder": hparams["data_folder"],
        "save_folder": hparams["save_folder"],
        "train_tsv_file": hparams["train_tsv_file"],
        "dev_tsv_file": hparams["dev_tsv_file"],
        "test_tsv_file": hparams["test_tsv_file"],
        "accented_letters": hparams["accented_letters"],
        "language": hparams["language"],
        "skip_prep": hparams["skip_prep"],
    },
)

Internal Helper Functions

Function Description
process_line(line, ...) Processes a single TSV line: extracts fields, reads audio duration, normalizes text. Returns a CVRow dataclass or None if the line should be filtered out.
create_csv(convert_to_wav, orig_tsv_file, csv_file, data_folder, accented_letters, language) Creates a single CSV file from one TSV file, processing all lines in parallel.
skip(save_csv_train, save_csv_dev, save_csv_test) Checks if all three output files already exist.
check_commonvoice_folders(data_folder) Validates that the data folder contains a clips/ subdirectory.
unicode_normalisation(text) Applies Unicode normalization to transcription text.
strip_accents(text) Removes diacritical marks using Unicode NFD decomposition and ASCII encoding.
language_specific_preprocess(language, words) Applies language-dependent text normalization rules (character filtering, case handling).
convert_mp3_to_wav(audio_mp3_path) Converts an MP3 file to WAV format using ffmpeg.

Dependencies

  • speechbrain.dataio.dataio.read_audio_info -- for reading audio file metadata (duration, sample rate)
  • speechbrain.utils.parallel.parallel_map -- for parallelized per-line processing
  • ffmpeg (system binary) -- required only when convert_to_wav=True

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment