Implementation:Speechbrain Speechbrain Prepare CommonVoice Seq2Seq

Knowledge Sources	SpeechBrain
Domains	Speech Recognition, Data Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing Mozilla Common Voice dataset for sequence-to-sequence ASR training provided by the SpeechBrain library.

Description

This script prepares CSV manifest files from the Mozilla Common Voice dataset for automatic speech recognition tasks. It reads the official Common Voice TSV files (train.tsv, dev.tsv, test.tsv), processes audio metadata including duration information, handles accented letter normalization via Unicode decomposition, supports optional conversion to WAV format, and generates SpeechBrain-compatible CSV files for train/dev/test splits. The script supports multiple languages and uses parallel processing for efficient preparation.

Usage

Use this when preparing the Mozilla Common Voice dataset for sequence-to-sequence ASR training with SpeechBrain recipes.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/CommonVoice/ASR/seq2seq/common_voice_prepare.py

Signature

def prepare_common_voice(
    data_folder,
    save_folder,
    train_tsv_file=None,
    dev_tsv_file=None,
    test_tsv_file=None,
    accented_letters=False,
    language="en",
    skip_prep=False,
    convert_to_wav=False,
):

Import

from common_voice_prepare import prepare_common_voice

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the folder where the original Common Voice dataset is stored (should include the language: /datasets/CommonVoice/<language>/)
save_folder	str	Yes	The directory where to store the output CSV files
train_tsv_file	str	No	Path to the train Common Voice .tsv file (default: auto-detected)
dev_tsv_file	str	No	Path to the dev Common Voice .tsv file (default: auto-detected)
test_tsv_file	str	No	Path to the test Common Voice .tsv file (default: auto-detected)
accented_letters	bool	No	Keep accented letters as-is or normalize to closest non-accented letters (default: False)
language	str	No	Language code for the dataset (default: "en")
skip_prep	bool	No	If True, skip data preparation entirely (default: False)
convert_to_wav	bool	No	If True, convert MP3 audio files to WAV format (default: False)

Outputs

Name	Type	Description
train.csv	CSV File	Train split manifest with utterance IDs, file paths, durations, and transcriptions
dev.csv	CSV File	Development/validation split manifest
test.csv	CSV File	Test split manifest

Usage Examples

from common_voice_prepare import prepare_common_voice

prepare_common_voice(
    data_folder="/datasets/CommonVoice/en",
    save_folder="/output/commonvoice_prepared",
    accented_letters=False,
    language="en",
    skip_prep=False,
)

Related Pages

Implementation:Speechbrain_Speechbrain_Prepare_CommonVoice_Transducer -- Same script used for transducer recipe
Implementation:Speechbrain_Speechbrain_Prepare_CommonVoice_LM -- Same script used for language model recipe
Implementation:Speechbrain_Speechbrain_Prepare_CommonVoice_SSL -- Same script used for self-supervised learning recipe
Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment