Implementation:Speechbrain Speechbrain Prepare CommonVoice Seq2Seq
| Knowledge Sources | |
|---|---|
| Domains | Speech Recognition, Data Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing Mozilla Common Voice dataset for sequence-to-sequence ASR training provided by the SpeechBrain library.
Description
This script prepares CSV manifest files from the Mozilla Common Voice dataset for automatic speech recognition tasks. It reads the official Common Voice TSV files (train.tsv, dev.tsv, test.tsv), processes audio metadata including duration information, handles accented letter normalization via Unicode decomposition, supports optional conversion to WAV format, and generates SpeechBrain-compatible CSV files for train/dev/test splits. The script supports multiple languages and uses parallel processing for efficient preparation.
Usage
Use this when preparing the Mozilla Common Voice dataset for sequence-to-sequence ASR training with SpeechBrain recipes.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/CommonVoice/ASR/seq2seq/common_voice_prepare.py
Signature
def prepare_common_voice(
data_folder,
save_folder,
train_tsv_file=None,
dev_tsv_file=None,
test_tsv_file=None,
accented_letters=False,
language="en",
skip_prep=False,
convert_to_wav=False,
):
Import
from common_voice_prepare import prepare_common_voice
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the folder where the original Common Voice dataset is stored (should include the language: /datasets/CommonVoice/<language>/) |
| save_folder | str | Yes | The directory where to store the output CSV files |
| train_tsv_file | str | No | Path to the train Common Voice .tsv file (default: auto-detected) |
| dev_tsv_file | str | No | Path to the dev Common Voice .tsv file (default: auto-detected) |
| test_tsv_file | str | No | Path to the test Common Voice .tsv file (default: auto-detected) |
| accented_letters | bool | No | Keep accented letters as-is or normalize to closest non-accented letters (default: False) |
| language | str | No | Language code for the dataset (default: "en") |
| skip_prep | bool | No | If True, skip data preparation entirely (default: False) |
| convert_to_wav | bool | No | If True, convert MP3 audio files to WAV format (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.csv | CSV File | Train split manifest with utterance IDs, file paths, durations, and transcriptions |
| dev.csv | CSV File | Development/validation split manifest |
| test.csv | CSV File | Test split manifest |
Usage Examples
from common_voice_prepare import prepare_common_voice
prepare_common_voice(
data_folder="/datasets/CommonVoice/en",
save_folder="/output/commonvoice_prepared",
accented_letters=False,
language="en",
skip_prep=False,
)
Related Pages
- Implementation:Speechbrain_Speechbrain_Prepare_CommonVoice_Transducer -- Same script used for transducer recipe
- Implementation:Speechbrain_Speechbrain_Prepare_CommonVoice_LM -- Same script used for language model recipe
- Implementation:Speechbrain_Speechbrain_Prepare_CommonVoice_SSL -- Same script used for self-supervised learning recipe
- Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation