Implementation:Speechbrain Speechbrain Prepare Common Voice

Field	Value
Implementation Name	Prepare_Common_Voice
API Signature	`prepare_common_voice(data_folder, save_folder, train_tsv_file=None, dev_tsv_file=None, test_tsv_file=None, accented_letters=False, language="en", skip_prep=False, convert_to_wav=False)`
Source File	recipes/CommonVoice/common_voice_prepare.py:L30-153
Import	`from common_voice_prepare import prepare_common_voice`
Type	API Doc
Related Principle	Principle:Speechbrain_Speechbrain_Data_Preparation_For_CTC_ASR

Description

The prepare_common_voice function converts Mozilla CommonVoice TSV data files into standardized CSV manifests suitable for SpeechBrain's DynamicItemDataset. It handles the complete data preparation pipeline including file validation, audio duration extraction, text normalization, language-specific preprocessing, and optional MP3-to-WAV conversion.

Inputs

Parameter	Type	Default	Description
`data_folder`	str	(required)	Path to the folder where the original CommonVoice dataset is stored. Should include the language subfolder, e.g., `/datasets/CommonVoice/en/`. Must contain a `clips/` subdirectory with audio files.
`save_folder`	str	(required)	Directory where the output CSV files will be written. Created automatically if it does not exist.
`train_tsv_file`	str	None	Path to the train TSV file. If None, defaults to `data_folder + "/train.tsv"`.
`dev_tsv_file`	str	None	Path to the dev TSV file. If None, defaults to `data_folder + "/dev.tsv"`.
`test_tsv_file`	str	None	Path to the test TSV file. If None, defaults to `data_folder + "/test.tsv"`.
`accented_letters`	bool	False	If True, keep accent marks on characters. If False, strip diacritical marks and convert to closest ASCII equivalents.
`language`	str	"en"	Target language code (e.g., "en", "fr", "de", "ar", "zh-CN"). Controls language-specific text normalization rules.
`skip_prep`	bool	False	If True, skip data preparation entirely and return immediately. Useful when resuming training.
`convert_to_wav`	bool	False	If True, convert MP3 audio files to uncompressed WAV format using ffmpeg. Increases disk usage but may speed up training if CPU-limited during audio decoding.

Outputs

Three CSV files are written to save_folder:

File	Description
`train.csv`	Training set manifest
`dev.csv`	Development/validation set manifest
`test.csv`	Test set manifest

Each CSV file contains the following columns:

Column	Type	Description
`ID`	str	Unique utterance identifier derived from the audio filename (without extension)
`duration`	float	Duration of the audio clip in seconds, computed from audio metadata
`wav`	str	Full path to the audio file (MP3 or WAV depending on `convert_to_wav`)
`spk_id`	str	Speaker identifier from the CommonVoice `client_id` field
`wrd`	str	Normalized transcription text

Key Behaviors

Idempotency

If all three output CSV files already exist, the function logs a message and returns without re-processing. This avoids redundant computation when restarting training.

Text Normalization Pipeline

For each utterance, the text undergoes:

Unicode normalization via unicode_normalisation()
Language-specific preprocessing via language_specific_preprocess() -- applies regex-based character filtering and case transformation specific to the language
Accent stripping (when accented_letters=False) -- removes diacritical marks using Unicode NFD decomposition
Whitespace normalization -- collapses multiple spaces, strips leading/trailing whitespace
Minimum length filtering -- discards utterances with fewer than 3 words (or fewer than 3 characters for Chinese/Japanese)

Parallel Processing

The per-line processing is parallelized using speechbrain.utils.parallel.parallel_map, which distributes the audio metadata reading and text normalization across multiple workers for faster processing of large datasets.

Atomic File Writing

Output CSV files are first written to a temporary file (.tmp suffix) and then atomically renamed using os.replace(). This prevents corrupted partial files from being left behind if the process is interrupted.

Usage Example

from common_voice_prepare import prepare_common_voice

# Prepare the CommonVoice English dataset
prepare_common_voice(
    data_folder="/datasets/CommonVoice/en",
    save_folder="results/commonvoice_ctc/save",
    train_tsv_file="/datasets/CommonVoice/en/train.tsv",
    dev_tsv_file="/datasets/CommonVoice/en/dev.tsv",
    test_tsv_file="/datasets/CommonVoice/en/test.tsv",
    accented_letters=False,
    language="en",
    skip_prep=False,
)

Integration in CTC Training Recipe

In the training script, data preparation is run only on the main process in distributed setups:

from speechbrain.utils.distributed import run_on_main
from common_voice_prepare import prepare_common_voice

run_on_main(
    prepare_common_voice,
    kwargs={
        "data_folder": hparams["data_folder"],
        "save_folder": hparams["save_folder"],
        "train_tsv_file": hparams["train_tsv_file"],
        "dev_tsv_file": hparams["dev_tsv_file"],
        "test_tsv_file": hparams["test_tsv_file"],
        "accented_letters": hparams["accented_letters"],
        "language": hparams["language"],
        "skip_prep": hparams["skip_prep"],
    },
)

Internal Helper Functions

Function	Description
`process_line(line, ...)`	Processes a single TSV line: extracts fields, reads audio duration, normalizes text. Returns a `CVRow` dataclass or None if the line should be filtered out.
`create_csv(convert_to_wav, orig_tsv_file, csv_file, data_folder, accented_letters, language)`	Creates a single CSV file from one TSV file, processing all lines in parallel.
`skip(save_csv_train, save_csv_dev, save_csv_test)`	Checks if all three output files already exist.
`check_commonvoice_folders(data_folder)`	Validates that the data folder contains a `clips/` subdirectory.
`unicode_normalisation(text)`	Applies Unicode normalization to transcription text.
`strip_accents(text)`	Removes diacritical marks using Unicode NFD decomposition and ASCII encoding.
`language_specific_preprocess(language, words)`	Applies language-dependent text normalization rules (character filtering, case handling).
`convert_mp3_to_wav(audio_mp3_path)`	Converts an MP3 file to WAV format using ffmpeg.

Dependencies

speechbrain.dataio.dataio.read_audio_info -- for reading audio file metadata (duration, sample rate)
speechbrain.utils.parallel.parallel_map -- for parallelized per-line processing
ffmpeg (system binary) -- required only when convert_to_wav=True

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment