Implementation:Speechbrain Speechbrain Prepare GSC

Knowledge Sources	SpeechBrain
Domains	Keyword_Spotting, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the Google Speech Commands V2 dataset for keyword spotting provided by the SpeechBrain library.

Description

This script prepares CSV data manifest files for the Google Speech Commands V2 dataset, which contains short audio clips of spoken commands. It handles automatic dataset download, splits data into train/validation/test using the official hashing-based assignment, supports configurable lists of wanted command words, and generates silence and unknown-word classes. The output CSV files include audio paths and class labels suitable for keyword spotting / command recognition tasks.

Usage

Use this when preparing the Google Speech Commands V2 dataset for keyword spotting or command recognition training with SpeechBrain recipes.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/Google-speech-commands/prepare_GSC.py

Signature

def prepare_GSC(
    data_folder,
    save_folder,
    validation_percentage=10,
    testing_percentage=10,
    percentage_unknown=10,
    percentage_silence=10,
    words_wanted=[
        "yes", "no", "up", "down", "left",
        "right", "on", "off", "stop", "go",
    ],
    skip_prep=False,
):

Import

from recipes.Google_speech_commands.prepare_GSC import prepare_GSC

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the dataset; if not present, it will be downloaded here
save_folder	str	Yes	Folder where data manifest files will be stored
validation_percentage	int	No	Percentage of data to use for validation (default: 10)
testing_percentage	int	No	Percentage of data to use for testing (default: 10)
percentage_unknown	int	No	Percentage of unknown words to preserve relative to known words (default: 10)
percentage_silence	int	No	Percentage of silence samples to generate relative to known words (default: 10)
words_wanted	list	No	List of commands to use from the dataset (default: 10 standard commands)
skip_prep	bool	No	If True, skip data preparation (default: False)

Outputs

Name	Type	Description
train.csv	CSV	Training split manifest with audio paths and command labels
valid.csv	CSV	Validation split manifest
test.csv	CSV	Test split manifest

Usage Examples

from recipes.Google_speech_commands.prepare_GSC import prepare_GSC

prepare_GSC(
    data_folder="/path/to/GSC",
    save_folder="/path/to/output",
    words_wanted=["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go"],
)

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment