Implementation:Speechbrain Speechbrain GigaSpeech Dataset
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool providing a HuggingFace Datasets builder for the GigaSpeech dataset used in SpeechBrain ASR recipes.
Description
This script implements a custom HuggingFace Datasets builder (Gigaspeech) for loading and streaming the GigaSpeech corpus. It defines dataset configurations for each subset size (xs, s, m, l, xl, dev, test), handles downloading audio archives and metadata CSVs from the HuggingFace Hub, and yields examples with audio, transcription, category, and source fields. This builder enables efficient streaming and sharded access to GigaSpeech data within the HuggingFace ecosystem.
Usage
Use this when loading the GigaSpeech dataset via HuggingFace Datasets for ASR training with SpeechBrain transducer recipes.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/GigaSpeech/ASR/transducer/dataset.py
Signature
class Gigaspeech(datasets.GeneratorBasedBuilder):
"""HuggingFace Datasets builder for GigaSpeech."""
VERSION = datasets.Version("1.0.0")
BUILDER_CONFIGS = [
GigaspeechConfig(name=subset) for subset in _SUBSETS + ("dev", "test")
]
Import
import datasets
# Load via HuggingFace Datasets using the dataset script
ds = datasets.load_dataset("path/to/dataset.py", name="xs")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | Subset configuration name: one of "xs", "s", "m", "l", "xl", "dev", or "test" |
Outputs
| Name | Type | Description |
|---|---|---|
| audio | Audio | Audio data decoded from tar.gz archives |
| text | str | Transcription text for the audio segment |
| category | str | Content category (e.g., "Entertainment", "Science and Technology") |
| source | str | Audio source type: "audiobook", "podcast", or "youtube" |
Usage Examples
import datasets
# Load the XS subset of GigaSpeech
ds = datasets.load_dataset(
"path/to/dataset.py",
name="xs",
split="train",
)
for example in ds:
print(example["text"], example["audio"]["sampling_rate"])