Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Prepare DVoice

From Leeroopedia


Knowledge Sources
Domains Speech Recognition, Data Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the DVoice dataset for ASR training provided by the SpeechBrain library.

Description

This script prepares CSV manifest files from the DVoice dataset, a multilingual speech corpus focused on African languages (including Fongbe and others) hosted on Zenodo. It reads the DVoice directory structure with text transcription files organized in train/dev/test splits, processes audio metadata including duration information, handles Unicode normalization and accented letter processing, and generates SpeechBrain-compatible CSV files for model training. The script supports configurable language selection and optional skip of preparation.

Usage

Use this when preparing the DVoice dataset for automatic speech recognition training with SpeechBrain recipes.

Code Reference

Source Location

Signature

def prepare_dvoice(
    data_folder,
    save_folder,
    train_csv_file=None,
    dev_csv_file=None,
    test_csv_file=None,
    accented_letters=False,
    language="fongbe",
    skip_prep=False,
):

Import

from dvoice_prepare import prepare_dvoice

I/O Contract

Inputs

Name Type Required Description
data_folder str Yes Path to the folder where the DVoice dataset is stored
save_folder str Yes The directory where to store the output CSV files
train_csv_file str No Path to the train CSV transcription file (default: data_folder/texts/train.csv)
dev_csv_file str No Path to the dev CSV transcription file (default: data_folder/texts/dev.csv)
test_csv_file str No Path to the test CSV transcription file (default: data_folder/texts/test.csv)
accented_letters bool No Keep accented letters as-is or normalize to closest non-accented letters (default: False)
language str No Language code for the dataset (default: "fongbe")
skip_prep bool No If True, skip data preparation entirely (default: False)

Outputs

Name Type Description
train.csv CSV File Train split manifest with utterance IDs, file paths, durations, and transcriptions
dev.csv CSV File Development/validation split manifest
test.csv CSV File Test split manifest

Usage Examples

from dvoice_prepare import prepare_dvoice

prepare_dvoice(
    data_folder="/datasets/DVoice/fongbe",
    save_folder="/output/dvoice_prepared",
    accented_letters=False,
    language="fongbe",
    skip_prep=False,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment