Implementation:Speechbrain Speechbrain Prepare Voicebank

Property	Value
Implementation Name	Prepare_Voicebank
API	`prepare_voicebank(data_folder, save_folder, valid_speaker_count=2, skip_prep=False)`
Source File	`recipes/Voicebank/enhance/MetricGAN-U/voicebank_prepare.py` (L157-254)
Import	`from voicebank_prepare import prepare_voicebank`
Type	API Doc
Workflow	Speech_Enhancement_Training
Domains	Data_Engineering, Speech_Enhancement
Related Principle	Principle:Speechbrain_Speechbrain_Noisy_Speech_Data_Preparation

Purpose

The prepare_voicebank function transforms the raw Voicebank-DEMAND dataset directory structure into structured JSON manifest files suitable for SpeechBrain's DynamicItemDataset. It handles speaker-based train/validation splitting, file discovery, duration extraction, phoneme labeling via lexicon lookup, and idempotent output generation.

Function Signature

def prepare_voicebank(
    data_folder,
    save_folder,
    valid_speaker_count=2,
    skip_prep=False
):
    """
    Prepares the json files for the Voicebank dataset.

    Arguments
    ---------
    data_folder : str
        Path to the folder where the original Voicebank dataset is stored.
    save_folder : str
        The directory where to store the json files.
    valid_speaker_count : int
        The number of validation speakers to use (out of 28 in train set).
    skip_prep : bool
        If True, skip data preparation.

    Returns
    -------
    None
    """

Parameters

Parameter	Type	Default	Description
`data_folder`	str	(required)	Root directory of the Voicebank-DEMAND dataset containing subdirectories: `clean_trainset_28spk_wav_16k`, `noisy_trainset_28spk_wav_16k`, `trainset_28spk_txt`, `clean_testset_wav_16k`, `noisy_testset_wav_16k`, `testset_txt`
`save_folder`	str	(required)	Output directory for the generated JSON manifest files
`valid_speaker_count`	int	2	Number of speakers from the 28 training speakers to hold out for validation
`skip_prep`	bool	False	If True, skip preparation entirely (for resuming experiments)

Outputs

The function generates three JSON manifest files:

File	Description	Typical Size
`train.json`	Training utterances from speakers not in validation set	~10,000 utterances (26 speakers)
`valid.json`	Validation utterances from held-out speakers	~800 utterances (2 speakers)
`test.json`	Test utterances from separate test speakers	~824 utterances

Each entry in the JSON files has the following structure:

{
    "p232_001": {
        "noisy_wav": "{data_root}/noisy_trainset_28spk_wav_16k/p232_001.wav",
        "clean_wav": "{data_root}/clean_trainset_28spk_wav_16k/p232_001.wav",
        "length": 3.45,
        "words": "PLEASE CALL STELLA",
        "phones": "P L IY Z K AO L S T EH L AH"
    }
}

Internal Processing Steps

The function performs these steps in sequence:

Skip check: If skip_prep=True or all output files already exist, return immediately
Folder validation: Verify all expected subdirectories exist via check_voicebank_folders()
Lexicon creation: Download the LibriSpeech lexicon and build a word-to-phoneme mapping via create_lexicon()
Speaker-based splitting: Use the first valid_speaker_count speakers from the predefined TRAIN_SPEAKERS list as validation speakers
File collection: Use get_all_files() with speaker-based filtering:
- Training: all .wav files in noisy trainset, excluding validation speakers
- Validation: all .wav files in noisy trainset, matching validation speakers
- Test: all .wav files in the noisy test set
JSON creation: For each utterance, read audio to compute duration, look up phonemes, and write structured JSON

Usage Examples

Basic Usage from a Training Recipe

from voicebank_prepare import prepare_voicebank
from speechbrain.utils.distributed import run_on_main

# Prepare data (only runs on main process in DDP)
run_on_main(
    prepare_voicebank,
    kwargs={
        "data_folder": "/data/noisy-vctk-16k",
        "save_folder": "/data/noisy-vctk-16k",
        "skip_prep": False,
    },
)

Custom Validation Split

from voicebank_prepare import prepare_voicebank

# Use 4 speakers for validation instead of default 2
prepare_voicebank(
    data_folder="/data/noisy-vctk-16k",
    save_folder="results/experiment_01",
    valid_speaker_count=4,
    skip_prep=False,
)

Integration with HyperPyYAML Config

import sys
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from voicebank_prepare import prepare_voicebank
from speechbrain.utils.distributed import run_on_main

# Load hyperparameters
hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
with open(hparams_file, encoding="utf-8") as fin:
    hparams = load_hyperpyyaml(fin, overrides)

# Prepare data using hparams
run_on_main(
    prepare_voicebank,
    kwargs={
        "data_folder": hparams["data_folder"],
        "save_folder": hparams["output_folder"],
        "skip_prep": hparams["skip_prep"],
    },
)

# Load prepared data into DynamicItemDataset
train_data = sb.dataio.dataset.DynamicItemDataset.from_json(
    json_path=hparams["train_annotation"],
    replacements={"data_root": hparams["data_folder"]},
)

Key Internal Functions

create_json

def create_json(wav_lst, json_file, clean_folder, txt_folder, lexicon):
    """Creates the json file given a list of wav files."""
    json_dict = {}
    for wav_file in wav_lst:
        noisy_path, filename = os.path.split(wav_file)
        _, noisy_dir = os.path.split(noisy_path)
        _, clean_dir = os.path.split(clean_folder)
        noisy_rel_path = os.path.join("{data_root}", noisy_dir, filename)
        clean_rel_path = os.path.join("{data_root}", clean_dir, filename)

        signal = read_audio(wav_file)
        duration = signal.shape[0] / SAMPLERATE

        snt_id = filename.replace(".wav", "")
        # ... phoneme lookup via lexicon ...
        json_dict[snt_id] = {
            "noisy_wav": noisy_rel_path,
            "clean_wav": clean_rel_path,
            "length": duration,
            "words": word_string,
            "phones": phone_string,
        }
    with open(json_file, mode="w", encoding="utf-8") as json_f:
        json.dump(json_dict, json_f, indent=2)

Speaker List

The predefined training speakers are:

TRAIN_SPEAKERS = [
    "p226", "p287", "p227", "p228", "p230", "p231", "p233", "p236",
    "p239", "p243", "p244", "p250", "p254", "p256", "p258", "p259",
    "p267", "p268", "p269", "p270", "p273", "p274", "p276", "p277",
    "p278", "p279", "p282", "p286",
]

With the default valid_speaker_count=2, speakers p226 and p287 are assigned to validation.

Edge Cases and Notes

Idempotency: The skip() function checks if all output JSON files already exist. If so, preparation is skipped entirely. This makes the function safe to call repeatedly.
DDP compatibility: The function is typically called via run_on_main(), ensuring only the main process performs data preparation in distributed training scenarios.
Missing lexicon entries: A MISSING_LEXICON dictionary provides phoneme entries for words not found in the standard LibriSpeech lexicon, handling edge cases in the VCTK transcriptions.
Sample rate assumption: The constant SAMPLERATE = 16000 is used for duration calculation. The raw data must already be resampled to 16 kHz (the download_vctk() utility handles this).

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment