Implementation:Speechbrain Speechbrain Prepare Common Voice
| Field | Value |
|---|---|
| Implementation Name | Prepare_Common_Voice |
| API Signature | prepare_common_voice(data_folder, save_folder, train_tsv_file=None, dev_tsv_file=None, test_tsv_file=None, accented_letters=False, language="en", skip_prep=False, convert_to_wav=False)
|
| Source File | recipes/CommonVoice/common_voice_prepare.py:L30-153 |
| Import | from common_voice_prepare import prepare_common_voice
|
| Type | API Doc |
| Related Principle | Principle:Speechbrain_Speechbrain_Data_Preparation_For_CTC_ASR |
Description
The prepare_common_voice function converts Mozilla CommonVoice TSV data files into standardized CSV manifests suitable for SpeechBrain's DynamicItemDataset. It handles the complete data preparation pipeline including file validation, audio duration extraction, text normalization, language-specific preprocessing, and optional MP3-to-WAV conversion.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
data_folder |
str | (required) | Path to the folder where the original CommonVoice dataset is stored. Should include the language subfolder, e.g., /datasets/CommonVoice/en/. Must contain a clips/ subdirectory with audio files.
|
save_folder |
str | (required) | Directory where the output CSV files will be written. Created automatically if it does not exist. |
train_tsv_file |
str | None | Path to the train TSV file. If None, defaults to data_folder + "/train.tsv".
|
dev_tsv_file |
str | None | Path to the dev TSV file. If None, defaults to data_folder + "/dev.tsv".
|
test_tsv_file |
str | None | Path to the test TSV file. If None, defaults to data_folder + "/test.tsv".
|
accented_letters |
bool | False | If True, keep accent marks on characters. If False, strip diacritical marks and convert to closest ASCII equivalents. |
language |
str | "en" | Target language code (e.g., "en", "fr", "de", "ar", "zh-CN"). Controls language-specific text normalization rules. |
skip_prep |
bool | False | If True, skip data preparation entirely and return immediately. Useful when resuming training. |
convert_to_wav |
bool | False | If True, convert MP3 audio files to uncompressed WAV format using ffmpeg. Increases disk usage but may speed up training if CPU-limited during audio decoding. |
Outputs
Three CSV files are written to save_folder:
| File | Description |
|---|---|
train.csv |
Training set manifest |
dev.csv |
Development/validation set manifest |
test.csv |
Test set manifest |
Each CSV file contains the following columns:
| Column | Type | Description |
|---|---|---|
ID |
str | Unique utterance identifier derived from the audio filename (without extension) |
duration |
float | Duration of the audio clip in seconds, computed from audio metadata |
wav |
str | Full path to the audio file (MP3 or WAV depending on convert_to_wav)
|
spk_id |
str | Speaker identifier from the CommonVoice client_id field
|
wrd |
str | Normalized transcription text |
Key Behaviors
Idempotency
If all three output CSV files already exist, the function logs a message and returns without re-processing. This avoids redundant computation when restarting training.
Text Normalization Pipeline
For each utterance, the text undergoes:
- Unicode normalization via
unicode_normalisation() - Language-specific preprocessing via
language_specific_preprocess()-- applies regex-based character filtering and case transformation specific to the language - Accent stripping (when
accented_letters=False) -- removes diacritical marks using Unicode NFD decomposition - Whitespace normalization -- collapses multiple spaces, strips leading/trailing whitespace
- Minimum length filtering -- discards utterances with fewer than 3 words (or fewer than 3 characters for Chinese/Japanese)
Parallel Processing
The per-line processing is parallelized using speechbrain.utils.parallel.parallel_map, which distributes the audio metadata reading and text normalization across multiple workers for faster processing of large datasets.
Atomic File Writing
Output CSV files are first written to a temporary file (.tmp suffix) and then atomically renamed using os.replace(). This prevents corrupted partial files from being left behind if the process is interrupted.
Usage Example
from common_voice_prepare import prepare_common_voice
# Prepare the CommonVoice English dataset
prepare_common_voice(
data_folder="/datasets/CommonVoice/en",
save_folder="results/commonvoice_ctc/save",
train_tsv_file="/datasets/CommonVoice/en/train.tsv",
dev_tsv_file="/datasets/CommonVoice/en/dev.tsv",
test_tsv_file="/datasets/CommonVoice/en/test.tsv",
accented_letters=False,
language="en",
skip_prep=False,
)
Integration in CTC Training Recipe
In the training script, data preparation is run only on the main process in distributed setups:
from speechbrain.utils.distributed import run_on_main
from common_voice_prepare import prepare_common_voice
run_on_main(
prepare_common_voice,
kwargs={
"data_folder": hparams["data_folder"],
"save_folder": hparams["save_folder"],
"train_tsv_file": hparams["train_tsv_file"],
"dev_tsv_file": hparams["dev_tsv_file"],
"test_tsv_file": hparams["test_tsv_file"],
"accented_letters": hparams["accented_letters"],
"language": hparams["language"],
"skip_prep": hparams["skip_prep"],
},
)
Internal Helper Functions
| Function | Description |
|---|---|
process_line(line, ...) |
Processes a single TSV line: extracts fields, reads audio duration, normalizes text. Returns a CVRow dataclass or None if the line should be filtered out.
|
create_csv(convert_to_wav, orig_tsv_file, csv_file, data_folder, accented_letters, language) |
Creates a single CSV file from one TSV file, processing all lines in parallel. |
skip(save_csv_train, save_csv_dev, save_csv_test) |
Checks if all three output files already exist. |
check_commonvoice_folders(data_folder) |
Validates that the data folder contains a clips/ subdirectory.
|
unicode_normalisation(text) |
Applies Unicode normalization to transcription text. |
strip_accents(text) |
Removes diacritical marks using Unicode NFD decomposition and ASCII encoding. |
language_specific_preprocess(language, words) |
Applies language-dependent text normalization rules (character filtering, case handling). |
convert_mp3_to_wav(audio_mp3_path) |
Converts an MP3 file to WAV format using ffmpeg. |
Dependencies
speechbrain.dataio.dataio.read_audio_info-- for reading audio file metadata (duration, sample rate)speechbrain.utils.parallel.parallel_map-- for parallelized per-line processingffmpeg(system binary) -- required only whenconvert_to_wav=True