Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain Prepare CommonVoice Seq2Seq

From Leeroopedia


Knowledge Sources
Domains Speech Recognition, Data Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for preparing Mozilla Common Voice dataset for sequence-to-sequence ASR training provided by the SpeechBrain library.

Description

This script prepares CSV manifest files from the Mozilla Common Voice dataset for automatic speech recognition tasks. It reads the official Common Voice TSV files (train.tsv, dev.tsv, test.tsv), processes audio metadata including duration information, handles accented letter normalization via Unicode decomposition, supports optional conversion to WAV format, and generates SpeechBrain-compatible CSV files for train/dev/test splits. The script supports multiple languages and uses parallel processing for efficient preparation.

Usage

Use this when preparing the Mozilla Common Voice dataset for sequence-to-sequence ASR training with SpeechBrain recipes.

Code Reference

Source Location

Signature

def prepare_common_voice(
    data_folder,
    save_folder,
    train_tsv_file=None,
    dev_tsv_file=None,
    test_tsv_file=None,
    accented_letters=False,
    language="en",
    skip_prep=False,
    convert_to_wav=False,
):

Import

from common_voice_prepare import prepare_common_voice

I/O Contract

Inputs

Name Type Required Description
data_folder str Yes Path to the folder where the original Common Voice dataset is stored (should include the language: /datasets/CommonVoice/<language>/)
save_folder str Yes The directory where to store the output CSV files
train_tsv_file str No Path to the train Common Voice .tsv file (default: auto-detected)
dev_tsv_file str No Path to the dev Common Voice .tsv file (default: auto-detected)
test_tsv_file str No Path to the test Common Voice .tsv file (default: auto-detected)
accented_letters bool No Keep accented letters as-is or normalize to closest non-accented letters (default: False)
language str No Language code for the dataset (default: "en")
skip_prep bool No If True, skip data preparation entirely (default: False)
convert_to_wav bool No If True, convert MP3 audio files to WAV format (default: False)

Outputs

Name Type Description
train.csv CSV File Train split manifest with utterance IDs, file paths, durations, and transcriptions
dev.csv CSV File Development/validation split manifest
test.csv CSV File Test split manifest

Usage Examples

from common_voice_prepare import prepare_common_voice

prepare_common_voice(
    data_folder="/datasets/CommonVoice/en",
    save_folder="/output/commonvoice_prepared",
    accented_letters=False,
    language="en",
    skip_prep=False,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment