Implementation:Speechbrain Speechbrain Prepare Voicebank
| Property | Value |
|---|---|
| Implementation Name | Prepare_Voicebank |
| API | prepare_voicebank(data_folder, save_folder, valid_speaker_count=2, skip_prep=False)
|
| Source File | recipes/Voicebank/enhance/MetricGAN-U/voicebank_prepare.py (L157-254)
|
| Import | from voicebank_prepare import prepare_voicebank
|
| Type | API Doc |
| Workflow | Speech_Enhancement_Training |
| Domains | Data_Engineering, Speech_Enhancement |
| Related Principle | Principle:Speechbrain_Speechbrain_Noisy_Speech_Data_Preparation |
Purpose
The prepare_voicebank function transforms the raw Voicebank-DEMAND dataset directory structure into structured JSON manifest files suitable for SpeechBrain's DynamicItemDataset. It handles speaker-based train/validation splitting, file discovery, duration extraction, phoneme labeling via lexicon lookup, and idempotent output generation.
Function Signature
def prepare_voicebank(
data_folder,
save_folder,
valid_speaker_count=2,
skip_prep=False
):
"""
Prepares the json files for the Voicebank dataset.
Arguments
---------
data_folder : str
Path to the folder where the original Voicebank dataset is stored.
save_folder : str
The directory where to store the json files.
valid_speaker_count : int
The number of validation speakers to use (out of 28 in train set).
skip_prep : bool
If True, skip data preparation.
Returns
-------
None
"""
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
data_folder |
str | (required) | Root directory of the Voicebank-DEMAND dataset containing subdirectories: clean_trainset_28spk_wav_16k, noisy_trainset_28spk_wav_16k, trainset_28spk_txt, clean_testset_wav_16k, noisy_testset_wav_16k, testset_txt
|
save_folder |
str | (required) | Output directory for the generated JSON manifest files |
valid_speaker_count |
int | 2 | Number of speakers from the 28 training speakers to hold out for validation |
skip_prep |
bool | False | If True, skip preparation entirely (for resuming experiments) |
Outputs
The function generates three JSON manifest files:
| File | Description | Typical Size |
|---|---|---|
train.json |
Training utterances from speakers not in validation set | ~10,000 utterances (26 speakers) |
valid.json |
Validation utterances from held-out speakers | ~800 utterances (2 speakers) |
test.json |
Test utterances from separate test speakers | ~824 utterances |
Each entry in the JSON files has the following structure:
{
"p232_001": {
"noisy_wav": "{data_root}/noisy_trainset_28spk_wav_16k/p232_001.wav",
"clean_wav": "{data_root}/clean_trainset_28spk_wav_16k/p232_001.wav",
"length": 3.45,
"words": "PLEASE CALL STELLA",
"phones": "P L IY Z K AO L S T EH L AH"
}
}
Internal Processing Steps
The function performs these steps in sequence:
- Skip check: If
skip_prep=Trueor all output files already exist, return immediately - Folder validation: Verify all expected subdirectories exist via
check_voicebank_folders() - Lexicon creation: Download the LibriSpeech lexicon and build a word-to-phoneme mapping via
create_lexicon() - Speaker-based splitting: Use the first
valid_speaker_countspeakers from the predefinedTRAIN_SPEAKERSlist as validation speakers - File collection: Use
get_all_files()with speaker-based filtering:- Training: all
.wavfiles in noisy trainset, excluding validation speakers - Validation: all
.wavfiles in noisy trainset, matching validation speakers - Test: all
.wavfiles in the noisy test set
- Training: all
- JSON creation: For each utterance, read audio to compute duration, look up phonemes, and write structured JSON
Usage Examples
Basic Usage from a Training Recipe
from voicebank_prepare import prepare_voicebank
from speechbrain.utils.distributed import run_on_main
# Prepare data (only runs on main process in DDP)
run_on_main(
prepare_voicebank,
kwargs={
"data_folder": "/data/noisy-vctk-16k",
"save_folder": "/data/noisy-vctk-16k",
"skip_prep": False,
},
)
Custom Validation Split
from voicebank_prepare import prepare_voicebank
# Use 4 speakers for validation instead of default 2
prepare_voicebank(
data_folder="/data/noisy-vctk-16k",
save_folder="results/experiment_01",
valid_speaker_count=4,
skip_prep=False,
)
Integration with HyperPyYAML Config
import sys
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from voicebank_prepare import prepare_voicebank
from speechbrain.utils.distributed import run_on_main
# Load hyperparameters
hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
with open(hparams_file, encoding="utf-8") as fin:
hparams = load_hyperpyyaml(fin, overrides)
# Prepare data using hparams
run_on_main(
prepare_voicebank,
kwargs={
"data_folder": hparams["data_folder"],
"save_folder": hparams["output_folder"],
"skip_prep": hparams["skip_prep"],
},
)
# Load prepared data into DynamicItemDataset
train_data = sb.dataio.dataset.DynamicItemDataset.from_json(
json_path=hparams["train_annotation"],
replacements={"data_root": hparams["data_folder"]},
)
Key Internal Functions
create_json
def create_json(wav_lst, json_file, clean_folder, txt_folder, lexicon):
"""Creates the json file given a list of wav files."""
json_dict = {}
for wav_file in wav_lst:
noisy_path, filename = os.path.split(wav_file)
_, noisy_dir = os.path.split(noisy_path)
_, clean_dir = os.path.split(clean_folder)
noisy_rel_path = os.path.join("{data_root}", noisy_dir, filename)
clean_rel_path = os.path.join("{data_root}", clean_dir, filename)
signal = read_audio(wav_file)
duration = signal.shape[0] / SAMPLERATE
snt_id = filename.replace(".wav", "")
# ... phoneme lookup via lexicon ...
json_dict[snt_id] = {
"noisy_wav": noisy_rel_path,
"clean_wav": clean_rel_path,
"length": duration,
"words": word_string,
"phones": phone_string,
}
with open(json_file, mode="w", encoding="utf-8") as json_f:
json.dump(json_dict, json_f, indent=2)
Speaker List
The predefined training speakers are:
TRAIN_SPEAKERS = [
"p226", "p287", "p227", "p228", "p230", "p231", "p233", "p236",
"p239", "p243", "p244", "p250", "p254", "p256", "p258", "p259",
"p267", "p268", "p269", "p270", "p273", "p274", "p276", "p277",
"p278", "p279", "p282", "p286",
]
With the default valid_speaker_count=2, speakers p226 and p287 are assigned to validation.
Edge Cases and Notes
- Idempotency: The
skip()function checks if all output JSON files already exist. If so, preparation is skipped entirely. This makes the function safe to call repeatedly. - DDP compatibility: The function is typically called via
run_on_main(), ensuring only the main process performs data preparation in distributed training scenarios. - Missing lexicon entries: A
MISSING_LEXICONdictionary provides phoneme entries for words not found in the standard LibriSpeech lexicon, handling edge cases in the VCTK transcriptions. - Sample rate assumption: The constant
SAMPLERATE = 16000is used for duration calculation. The raw data must already be resampled to 16 kHz (thedownload_vctk()utility handles this).
See Also
- Principle:Speechbrain_Speechbrain_Noisy_Speech_Data_Preparation -- The theoretical foundation for paired data preparation
- Implementation:Speechbrain_Speechbrain_Load_Hyperpyyaml_Enhancement -- How YAML configs reference the prepared data files
- Implementation:Speechbrain_Speechbrain_SEBrain_Compute_Forward -- How prepared data flows into model training