Implementation:Speechbrain Speechbrain GigaSpeech Dataset

Knowledge Sources	SpeechBrain
Domains	Speech_Recognition, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool providing a HuggingFace Datasets builder for the GigaSpeech dataset used in SpeechBrain ASR recipes.

Description

This script implements a custom HuggingFace Datasets builder (Gigaspeech) for loading and streaming the GigaSpeech corpus. It defines dataset configurations for each subset size (xs, s, m, l, xl, dev, test), handles downloading audio archives and metadata CSVs from the HuggingFace Hub, and yields examples with audio, transcription, category, and source fields. This builder enables efficient streaming and sharded access to GigaSpeech data within the HuggingFace ecosystem.

Usage

Use this when loading the GigaSpeech dataset via HuggingFace Datasets for ASR training with SpeechBrain transducer recipes.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/GigaSpeech/ASR/transducer/dataset.py

Signature

class Gigaspeech(datasets.GeneratorBasedBuilder):
    """HuggingFace Datasets builder for GigaSpeech."""

    VERSION = datasets.Version("1.0.0")

    BUILDER_CONFIGS = [
        GigaspeechConfig(name=subset) for subset in _SUBSETS + ("dev", "test")
    ]

Import

import datasets

# Load via HuggingFace Datasets using the dataset script
ds = datasets.load_dataset("path/to/dataset.py", name="xs")

I/O Contract

Inputs

Name	Type	Required	Description
name	str	Yes	Subset configuration name: one of "xs", "s", "m", "l", "xl", "dev", or "test"

Outputs

Name	Type	Description
audio	Audio	Audio data decoded from tar.gz archives
text	str	Transcription text for the audio segment
category	str	Content category (e.g., "Entertainment", "Science and Technology")
source	str	Audio source type: "audiobook", "podcast", or "youtube"

Usage Examples

import datasets

# Load the XS subset of GigaSpeech
ds = datasets.load_dataset(
    "path/to/dataset.py",
    name="xs",
    split="train",
)

for example in ds:
    print(example["text"], example["audio"]["sampling_rate"])

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment