Implementation:Speechbrain Speechbrain Prepare CommonLanguage
| Knowledge Sources | |
|---|---|
| Domains | Language Identification, Data Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing CommonLanguage dataset for language identification provided by the SpeechBrain library.
Description
This script prepares CSV manifest files from the CommonLanguage dataset for spoken language identification (LID) tasks. It processes audio files across 45 languages (including Arabic, Basque, Catalan, Chinese variants, English, French, German, and many others from the Common Voice ecosystem), reads audio durations via torchaudio, and generates train/dev/test CSV files with utterance metadata. The dataset is sourced from Zenodo and uses a predefined list of supported languages.
Usage
Use this when preparing the CommonLanguage dataset for language identification training with SpeechBrain recipes.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/CommonLanguage/common_language_prepare.py
Signature
def prepare_common_language(data_folder, save_folder, skip_prep=False):
Import
from common_language_prepare import prepare_common_language
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the folder where the CommonLanguage dataset is stored (should include the multi-language data: /datasets/CommonLanguage) |
| save_folder | str | Yes | The directory where to store the output CSV files |
| skip_prep | bool | No | If True, skip data preparation entirely (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.csv | CSV File | Train split manifest with utterance IDs, file paths, durations, and language labels |
| dev.csv | CSV File | Development/validation split manifest |
| test.csv | CSV File | Test split manifest |
Usage Examples
from common_language_prepare import prepare_common_language
prepare_common_language(
data_folder="/datasets/CommonLanguage",
save_folder="exp/CommonLanguage_exp",
skip_prep=False,
)