Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain GigaSpeech Dataset

From Leeroopedia


Knowledge Sources
Domains Speech_Recognition, Data_Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool providing a HuggingFace Datasets builder for the GigaSpeech dataset used in SpeechBrain ASR recipes.

Description

This script implements a custom HuggingFace Datasets builder (Gigaspeech) for loading and streaming the GigaSpeech corpus. It defines dataset configurations for each subset size (xs, s, m, l, xl, dev, test), handles downloading audio archives and metadata CSVs from the HuggingFace Hub, and yields examples with audio, transcription, category, and source fields. This builder enables efficient streaming and sharded access to GigaSpeech data within the HuggingFace ecosystem.

Usage

Use this when loading the GigaSpeech dataset via HuggingFace Datasets for ASR training with SpeechBrain transducer recipes.

Code Reference

Source Location

Signature

class Gigaspeech(datasets.GeneratorBasedBuilder):
    """HuggingFace Datasets builder for GigaSpeech."""

    VERSION = datasets.Version("1.0.0")

    BUILDER_CONFIGS = [
        GigaspeechConfig(name=subset) for subset in _SUBSETS + ("dev", "test")
    ]

Import

import datasets

# Load via HuggingFace Datasets using the dataset script
ds = datasets.load_dataset("path/to/dataset.py", name="xs")

I/O Contract

Inputs

Name Type Required Description
name str Yes Subset configuration name: one of "xs", "s", "m", "l", "xl", "dev", or "test"

Outputs

Name Type Description
audio Audio Audio data decoded from tar.gz archives
text str Transcription text for the audio segment
category str Content category (e.g., "Entertainment", "Science and Technology")
source str Audio source type: "audiobook", "podcast", or "youtube"

Usage Examples

import datasets

# Load the XS subset of GigaSpeech
ds = datasets.load_dataset(
    "path/to/dataset.py",
    name="xs",
    split="train",
)

for example in ds:
    print(example["text"], example["audio"]["sampling_rate"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment