Implementation:Speechbrain Speechbrain Prepare GigaSpeech

Knowledge Sources	SpeechBrain
Domains	Speech_Recognition, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the GigaSpeech dataset for automatic speech recognition provided by the SpeechBrain library.

Description

This script creates CSV data manifest files for the GigaSpeech dataset, a large-scale multi-domain English speech recognition corpus with up to 10,000 hours of labeled audio. It processes the GigaSpeech JSON metadata, handles filler words and punctuation tags, converts OPUS audio to WAV format, and supports downloading via both the official tool and HuggingFace Datasets. The script supports configurable training subsets (XS, S, M, L, XL) and generates train, dev, and test CSV splits.

Usage

Use this when preparing the GigaSpeech dataset for ASR training with SpeechBrain recipes, particularly the transducer recipe.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/GigaSpeech/ASR/transducer/gigaspeech_prepare.py

Signature

def prepare_gigaspeech(
    data_folder: str,
    save_folder: str,
    splits: list,
    output_train: str,
    output_dev: str,
    output_test: str,
    json_file: str = "GigaSpeech.json",
    skip_prep: bool = False,
    convert_opus_to_wav: bool = True,
    download_with_HF: bool = False,
    punctuation: bool = False,
    filler: bool = False,
    hf_multiprocess_load: bool = True,
) -> None:

Import

from recipes.GigaSpeech.ASR.transducer.gigaspeech_prepare import prepare_gigaspeech

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the GigaSpeech dataset
save_folder	str	Yes	Path to the folder where CSV files will be saved
splits	list	Yes	List of splits to create CSV files for (e.g. ["XS", "DEV", "TEST"])
output_train	str	Yes	Path where the train CSV will be saved
output_dev	str	Yes	Path where the dev CSV will be saved
output_test	str	Yes	Path where the test CSV will be saved
json_file	str	No	Name of the GigaSpeech JSON metadata file (default: "GigaSpeech.json")
skip_prep	bool	No	If True, skip data preparation (default: False)
convert_opus_to_wav	bool	No	Convert OPUS audio to WAV (default: True)
download_with_HF	bool	No	Download using HuggingFace Datasets (default: False)
punctuation	bool	No	Include punctuation tags in transcripts (default: False)
filler	bool	No	Include filler words in transcripts (default: False)
hf_multiprocess_load	bool	No	Use multiprocessing for HF loading (default: True)

Outputs

Name	Type	Description
train.csv	CSV	Training split manifest with audio paths, durations, and transcriptions
dev.csv	CSV	Development split manifest
test.csv	CSV	Test split manifest

Usage Examples

from recipes.GigaSpeech.ASR.transducer.gigaspeech_prepare import prepare_gigaspeech

prepare_gigaspeech(
    data_folder="/path/to/GigaSpeech",
    save_folder="/path/to/output",
    splits=["XS", "DEV", "TEST"],
    output_train="train.csv",
    output_dev="dev.csv",
    output_test="test.csv",
)

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment