Implementation:Speechbrain Speechbrain Prepare DVoice

Knowledge Sources	SpeechBrain
Domains	Speech Recognition, Data Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the DVoice dataset for ASR training provided by the SpeechBrain library.

Description

This script prepares CSV manifest files from the DVoice dataset, a multilingual speech corpus focused on African languages (including Fongbe and others) hosted on Zenodo. It reads the DVoice directory structure with text transcription files organized in train/dev/test splits, processes audio metadata including duration information, handles Unicode normalization and accented letter processing, and generates SpeechBrain-compatible CSV files for model training. The script supports configurable language selection and optional skip of preparation.

Usage

Use this when preparing the DVoice dataset for automatic speech recognition training with SpeechBrain recipes.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/DVoice/dvoice_prepare.py

Signature

def prepare_dvoice(
    data_folder,
    save_folder,
    train_csv_file=None,
    dev_csv_file=None,
    test_csv_file=None,
    accented_letters=False,
    language="fongbe",
    skip_prep=False,
):

Import

from dvoice_prepare import prepare_dvoice

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the folder where the DVoice dataset is stored
save_folder	str	Yes	The directory where to store the output CSV files
train_csv_file	str	No	Path to the train CSV transcription file (default: data_folder/texts/train.csv)
dev_csv_file	str	No	Path to the dev CSV transcription file (default: data_folder/texts/dev.csv)
test_csv_file	str	No	Path to the test CSV transcription file (default: data_folder/texts/test.csv)
accented_letters	bool	No	Keep accented letters as-is or normalize to closest non-accented letters (default: False)
language	str	No	Language code for the dataset (default: "fongbe")
skip_prep	bool	No	If True, skip data preparation entirely (default: False)

Outputs

Name	Type	Description
train.csv	CSV File	Train split manifest with utterance IDs, file paths, durations, and transcriptions
dev.csv	CSV File	Development/validation split manifest
test.csv	CSV File	Test split manifest

Usage Examples

from dvoice_prepare import prepare_dvoice

prepare_dvoice(
    data_folder="/datasets/DVoice/fongbe",
    save_folder="/output/dvoice_prepared",
    accented_letters=False,
    language="fongbe",
    skip_prep=False,
)

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment