Implementation:Speechbrain Speechbrain Prepare GigaSpeech
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing the GigaSpeech dataset for automatic speech recognition provided by the SpeechBrain library.
Description
This script creates CSV data manifest files for the GigaSpeech dataset, a large-scale multi-domain English speech recognition corpus with up to 10,000 hours of labeled audio. It processes the GigaSpeech JSON metadata, handles filler words and punctuation tags, converts OPUS audio to WAV format, and supports downloading via both the official tool and HuggingFace Datasets. The script supports configurable training subsets (XS, S, M, L, XL) and generates train, dev, and test CSV splits.
Usage
Use this when preparing the GigaSpeech dataset for ASR training with SpeechBrain recipes, particularly the transducer recipe.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/GigaSpeech/ASR/transducer/gigaspeech_prepare.py
Signature
def prepare_gigaspeech(
data_folder: str,
save_folder: str,
splits: list,
output_train: str,
output_dev: str,
output_test: str,
json_file: str = "GigaSpeech.json",
skip_prep: bool = False,
convert_opus_to_wav: bool = True,
download_with_HF: bool = False,
punctuation: bool = False,
filler: bool = False,
hf_multiprocess_load: bool = True,
) -> None:
Import
from recipes.GigaSpeech.ASR.transducer.gigaspeech_prepare import prepare_gigaspeech
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the GigaSpeech dataset |
| save_folder | str | Yes | Path to the folder where CSV files will be saved |
| splits | list | Yes | List of splits to create CSV files for (e.g. ["XS", "DEV", "TEST"]) |
| output_train | str | Yes | Path where the train CSV will be saved |
| output_dev | str | Yes | Path where the dev CSV will be saved |
| output_test | str | Yes | Path where the test CSV will be saved |
| json_file | str | No | Name of the GigaSpeech JSON metadata file (default: "GigaSpeech.json") |
| skip_prep | bool | No | If True, skip data preparation (default: False) |
| convert_opus_to_wav | bool | No | Convert OPUS audio to WAV (default: True) |
| download_with_HF | bool | No | Download using HuggingFace Datasets (default: False) |
| punctuation | bool | No | Include punctuation tags in transcripts (default: False) |
| filler | bool | No | Include filler words in transcripts (default: False) |
| hf_multiprocess_load | bool | No | Use multiprocessing for HF loading (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.csv | CSV | Training split manifest with audio paths, durations, and transcriptions |
| dev.csv | CSV | Development split manifest |
| test.csv | CSV | Test split manifest |
Usage Examples
from recipes.GigaSpeech.ASR.transducer.gigaspeech_prepare import prepare_gigaspeech
prepare_gigaspeech(
data_folder="/path/to/GigaSpeech",
save_folder="/path/to/output",
splits=["XS", "DEV", "TEST"],
output_train="train.csv",
output_dev="dev.csv",
output_test="test.csv",
)