Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Speechbrain Speechbrain LibriSpeech LM Dataset

From Leeroopedia


Knowledge Sources
Domains Language_Modeling, Data_Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for building a HuggingFace Datasets-compatible LibriSpeech language modeling dataset provided by the SpeechBrain library.

Description

This module implements a custom HuggingFace Datasets builder for the LibriSpeech language modeling corpus. It extends the official HuggingFace implementation to allow the use of both the train-960 transcripts and the external LM corpus (librispeech-lm-norm.txt) for language model training. The LibrispeechLmConfig class provides a configurable lm_corpus_path that defaults to the OpenSLR download URL. The LibrispeechLm builder class handles downloading, extraction, and generation of text examples. For training splits, it concatenates the external LM corpus with train transcript files. The generator removes utterance IDs from transcripts, filters empty lines, and skips very long sentences (over 1000 characters) to prevent out-of-memory errors. Each yielded example consists of a single "text" field.

Usage

Use as a HuggingFace Datasets builder in conjunction with the LibriSpeech LM training recipe. Can be loaded via the datasets library with custom configuration for the LM corpus path.

Code Reference

Source Location

Signature

class LibrispeechLmConfig(datasets.BuilderConfig):
    """Builder config for LibriSpeech LM."""

    def __init__(self, **kwargs):
        self.lm_corpus_path = kwargs.pop("lm_corpus_path", None)
        super().__init__(**kwargs)

    def __post_init__(self):
        if self.lm_corpus_path is None:
            self.lm_corpus_path = _DL_URL
        ...

class LibrispeechLm(datasets.GeneratorBasedBuilder):
    """Librispeech language modeling dataset."""

    VERSION = datasets.Version("0.1.0")
    BUILDER_CONFIG_CLASS = LibrispeechLmConfig

    def _info(self):
        ...

    def _split_generators(self, dl_manager):
        """Returns SplitGenerators."""
        ...

    def _generate_examples(self, archive_path):
        """Yields examples."""
        ...

Import

import datasets

# Load using HuggingFace datasets with the custom script
dataset = datasets.load_dataset("./dataset.py", data_files={"train": [...], "test": [...]})

I/O Contract

Inputs

Name Type Required Description
lm_corpus_path str No Path or URL to the LM corpus file (default: OpenSLR librispeech-lm-norm.txt.gz)
data_files dict Yes Mapping of split names to transcript file paths

Outputs

Name Type Description
text str A single line of text from the LM corpus or transcript (utterance IDs removed)

Usage Examples

import datasets

# Load with default LM corpus (auto-downloaded from OpenSLR)
lm_dataset = datasets.load_dataset(
    "./dataset.py",
    data_files={
        "train": ["train-960-transcripts.txt"],
        "test": ["test-transcripts.txt"],
    },
)

# Iterate over examples
for example in lm_dataset["train"]:
    text = example["text"]
    # Process text for LM training...

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment