Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Prepare Voicebank

From Leeroopedia


Property Value
Implementation Name Prepare_Voicebank
API prepare_voicebank(data_folder, save_folder, valid_speaker_count=2, skip_prep=False)
Source File recipes/Voicebank/enhance/MetricGAN-U/voicebank_prepare.py (L157-254)
Import from voicebank_prepare import prepare_voicebank
Type API Doc
Workflow Speech_Enhancement_Training
Domains Data_Engineering, Speech_Enhancement
Related Principle Principle:Speechbrain_Speechbrain_Noisy_Speech_Data_Preparation

Purpose

The prepare_voicebank function transforms the raw Voicebank-DEMAND dataset directory structure into structured JSON manifest files suitable for SpeechBrain's DynamicItemDataset. It handles speaker-based train/validation splitting, file discovery, duration extraction, phoneme labeling via lexicon lookup, and idempotent output generation.

Function Signature

def prepare_voicebank(
    data_folder,
    save_folder,
    valid_speaker_count=2,
    skip_prep=False
):
    """
    Prepares the json files for the Voicebank dataset.

    Arguments
    ---------
    data_folder : str
        Path to the folder where the original Voicebank dataset is stored.
    save_folder : str
        The directory where to store the json files.
    valid_speaker_count : int
        The number of validation speakers to use (out of 28 in train set).
    skip_prep : bool
        If True, skip data preparation.

    Returns
    -------
    None
    """

Parameters

Parameter Type Default Description
data_folder str (required) Root directory of the Voicebank-DEMAND dataset containing subdirectories: clean_trainset_28spk_wav_16k, noisy_trainset_28spk_wav_16k, trainset_28spk_txt, clean_testset_wav_16k, noisy_testset_wav_16k, testset_txt
save_folder str (required) Output directory for the generated JSON manifest files
valid_speaker_count int 2 Number of speakers from the 28 training speakers to hold out for validation
skip_prep bool False If True, skip preparation entirely (for resuming experiments)

Outputs

The function generates three JSON manifest files:

File Description Typical Size
train.json Training utterances from speakers not in validation set ~10,000 utterances (26 speakers)
valid.json Validation utterances from held-out speakers ~800 utterances (2 speakers)
test.json Test utterances from separate test speakers ~824 utterances

Each entry in the JSON files has the following structure:

{
    "p232_001": {
        "noisy_wav": "{data_root}/noisy_trainset_28spk_wav_16k/p232_001.wav",
        "clean_wav": "{data_root}/clean_trainset_28spk_wav_16k/p232_001.wav",
        "length": 3.45,
        "words": "PLEASE CALL STELLA",
        "phones": "P L IY Z K AO L S T EH L AH"
    }
}

Internal Processing Steps

The function performs these steps in sequence:

  1. Skip check: If skip_prep=True or all output files already exist, return immediately
  2. Folder validation: Verify all expected subdirectories exist via check_voicebank_folders()
  3. Lexicon creation: Download the LibriSpeech lexicon and build a word-to-phoneme mapping via create_lexicon()
  4. Speaker-based splitting: Use the first valid_speaker_count speakers from the predefined TRAIN_SPEAKERS list as validation speakers
  5. File collection: Use get_all_files() with speaker-based filtering:
    • Training: all .wav files in noisy trainset, excluding validation speakers
    • Validation: all .wav files in noisy trainset, matching validation speakers
    • Test: all .wav files in the noisy test set
  6. JSON creation: For each utterance, read audio to compute duration, look up phonemes, and write structured JSON

Usage Examples

Basic Usage from a Training Recipe

from voicebank_prepare import prepare_voicebank
from speechbrain.utils.distributed import run_on_main

# Prepare data (only runs on main process in DDP)
run_on_main(
    prepare_voicebank,
    kwargs={
        "data_folder": "/data/noisy-vctk-16k",
        "save_folder": "/data/noisy-vctk-16k",
        "skip_prep": False,
    },
)

Custom Validation Split

from voicebank_prepare import prepare_voicebank

# Use 4 speakers for validation instead of default 2
prepare_voicebank(
    data_folder="/data/noisy-vctk-16k",
    save_folder="results/experiment_01",
    valid_speaker_count=4,
    skip_prep=False,
)

Integration with HyperPyYAML Config

import sys
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from voicebank_prepare import prepare_voicebank
from speechbrain.utils.distributed import run_on_main

# Load hyperparameters
hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
with open(hparams_file, encoding="utf-8") as fin:
    hparams = load_hyperpyyaml(fin, overrides)

# Prepare data using hparams
run_on_main(
    prepare_voicebank,
    kwargs={
        "data_folder": hparams["data_folder"],
        "save_folder": hparams["output_folder"],
        "skip_prep": hparams["skip_prep"],
    },
)

# Load prepared data into DynamicItemDataset
train_data = sb.dataio.dataset.DynamicItemDataset.from_json(
    json_path=hparams["train_annotation"],
    replacements={"data_root": hparams["data_folder"]},
)

Key Internal Functions

create_json

def create_json(wav_lst, json_file, clean_folder, txt_folder, lexicon):
    """Creates the json file given a list of wav files."""
    json_dict = {}
    for wav_file in wav_lst:
        noisy_path, filename = os.path.split(wav_file)
        _, noisy_dir = os.path.split(noisy_path)
        _, clean_dir = os.path.split(clean_folder)
        noisy_rel_path = os.path.join("{data_root}", noisy_dir, filename)
        clean_rel_path = os.path.join("{data_root}", clean_dir, filename)

        signal = read_audio(wav_file)
        duration = signal.shape[0] / SAMPLERATE

        snt_id = filename.replace(".wav", "")
        # ... phoneme lookup via lexicon ...
        json_dict[snt_id] = {
            "noisy_wav": noisy_rel_path,
            "clean_wav": clean_rel_path,
            "length": duration,
            "words": word_string,
            "phones": phone_string,
        }
    with open(json_file, mode="w", encoding="utf-8") as json_f:
        json.dump(json_dict, json_f, indent=2)

Speaker List

The predefined training speakers are:

TRAIN_SPEAKERS = [
    "p226", "p287", "p227", "p228", "p230", "p231", "p233", "p236",
    "p239", "p243", "p244", "p250", "p254", "p256", "p258", "p259",
    "p267", "p268", "p269", "p270", "p273", "p274", "p276", "p277",
    "p278", "p279", "p282", "p286",
]

With the default valid_speaker_count=2, speakers p226 and p287 are assigned to validation.

Edge Cases and Notes

  • Idempotency: The skip() function checks if all output JSON files already exist. If so, preparation is skipped entirely. This makes the function safe to call repeatedly.
  • DDP compatibility: The function is typically called via run_on_main(), ensuring only the main process performs data preparation in distributed training scenarios.
  • Missing lexicon entries: A MISSING_LEXICON dictionary provides phoneme entries for words not found in the standard LibriSpeech lexicon, handling edge cases in the VCTK transcriptions.
  • Sample rate assumption: The constant SAMPLERATE = 16000 is used for duration calculation. The raw data must already be resampled to 16 kHz (the download_vctk() utility handles this).

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment