Implementation:Speechbrain Speechbrain Prepare DVoice
| Knowledge Sources | |
|---|---|
| Domains | Speech Recognition, Data Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing the DVoice dataset for ASR training provided by the SpeechBrain library.
Description
This script prepares CSV manifest files from the DVoice dataset, a multilingual speech corpus focused on African languages (including Fongbe and others) hosted on Zenodo. It reads the DVoice directory structure with text transcription files organized in train/dev/test splits, processes audio metadata including duration information, handles Unicode normalization and accented letter processing, and generates SpeechBrain-compatible CSV files for model training. The script supports configurable language selection and optional skip of preparation.
Usage
Use this when preparing the DVoice dataset for automatic speech recognition training with SpeechBrain recipes.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/DVoice/dvoice_prepare.py
Signature
def prepare_dvoice(
data_folder,
save_folder,
train_csv_file=None,
dev_csv_file=None,
test_csv_file=None,
accented_letters=False,
language="fongbe",
skip_prep=False,
):
Import
from dvoice_prepare import prepare_dvoice
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the folder where the DVoice dataset is stored |
| save_folder | str | Yes | The directory where to store the output CSV files |
| train_csv_file | str | No | Path to the train CSV transcription file (default: data_folder/texts/train.csv) |
| dev_csv_file | str | No | Path to the dev CSV transcription file (default: data_folder/texts/dev.csv) |
| test_csv_file | str | No | Path to the test CSV transcription file (default: data_folder/texts/test.csv) |
| accented_letters | bool | No | Keep accented letters as-is or normalize to closest non-accented letters (default: False) |
| language | str | No | Language code for the dataset (default: "fongbe") |
| skip_prep | bool | No | If True, skip data preparation entirely (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.csv | CSV File | Train split manifest with utterance IDs, file paths, durations, and transcriptions |
| dev.csv | CSV File | Development/validation split manifest |
| test.csv | CSV File | Test split manifest |
Usage Examples
from dvoice_prepare import prepare_dvoice
prepare_dvoice(
data_folder="/datasets/DVoice/fongbe",
save_folder="/output/dvoice_prepared",
accented_letters=False,
language="fongbe",
skip_prep=False,
)