Implementation:Speechbrain Speechbrain LibriSpeech LM Dataset

Knowledge Sources	SpeechBrain
Domains	Language_Modeling, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for building a HuggingFace Datasets-compatible LibriSpeech language modeling dataset provided by the SpeechBrain library.

Description

This module implements a custom HuggingFace Datasets builder for the LibriSpeech language modeling corpus. It extends the official HuggingFace implementation to allow the use of both the train-960 transcripts and the external LM corpus (librispeech-lm-norm.txt) for language model training. The LibrispeechLmConfig class provides a configurable lm_corpus_path that defaults to the OpenSLR download URL. The LibrispeechLm builder class handles downloading, extraction, and generation of text examples. For training splits, it concatenates the external LM corpus with train transcript files. The generator removes utterance IDs from transcripts, filters empty lines, and skips very long sentences (over 1000 characters) to prevent out-of-memory errors. Each yielded example consists of a single "text" field.

Usage

Use as a HuggingFace Datasets builder in conjunction with the LibriSpeech LM training recipe. Can be loaded via the datasets library with custom configuration for the LM corpus path.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/LibriSpeech/LM/dataset.py

Signature

class LibrispeechLmConfig(datasets.BuilderConfig):
    """Builder config for LibriSpeech LM."""

    def __init__(self, **kwargs):
        self.lm_corpus_path = kwargs.pop("lm_corpus_path", None)
        super().__init__(**kwargs)

    def __post_init__(self):
        if self.lm_corpus_path is None:
            self.lm_corpus_path = _DL_URL
        ...

class LibrispeechLm(datasets.GeneratorBasedBuilder):
    """Librispeech language modeling dataset."""

    VERSION = datasets.Version("0.1.0")
    BUILDER_CONFIG_CLASS = LibrispeechLmConfig

    def _info(self):
        ...

    def _split_generators(self, dl_manager):
        """Returns SplitGenerators."""
        ...

    def _generate_examples(self, archive_path):
        """Yields examples."""
        ...

Import

import datasets

# Load using HuggingFace datasets with the custom script
dataset = datasets.load_dataset("./dataset.py", data_files={"train": [...], "test": [...]})

I/O Contract

Inputs

Name	Type	Required	Description
lm_corpus_path	str	No	Path or URL to the LM corpus file (default: OpenSLR librispeech-lm-norm.txt.gz)
data_files	dict	Yes	Mapping of split names to transcript file paths

Outputs

Name	Type	Description
text	str	A single line of text from the LM corpus or transcript (utterance IDs removed)

Usage Examples

import datasets

# Load with default LM corpus (auto-downloaded from OpenSLR)
lm_dataset = datasets.load_dataset(
    "./dataset.py",
    data_files={
        "train": ["train-960-transcripts.txt"],
        "test": ["test-transcripts.txt"],
    },
)

# Iterate over examples
for example in lm_dataset["train"]:
    text = example["text"]
    # Process text for LM training...

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment