Implementation:Speechbrain Speechbrain LibriSpeech LM Dataset
| Knowledge Sources | |
|---|---|
| Domains | Language_Modeling, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for building a HuggingFace Datasets-compatible LibriSpeech language modeling dataset provided by the SpeechBrain library.
Description
This module implements a custom HuggingFace Datasets builder for the LibriSpeech language modeling corpus. It extends the official HuggingFace implementation to allow the use of both the train-960 transcripts and the external LM corpus (librispeech-lm-norm.txt) for language model training. The LibrispeechLmConfig class provides a configurable lm_corpus_path that defaults to the OpenSLR download URL. The LibrispeechLm builder class handles downloading, extraction, and generation of text examples. For training splits, it concatenates the external LM corpus with train transcript files. The generator removes utterance IDs from transcripts, filters empty lines, and skips very long sentences (over 1000 characters) to prevent out-of-memory errors. Each yielded example consists of a single "text" field.
Usage
Use as a HuggingFace Datasets builder in conjunction with the LibriSpeech LM training recipe. Can be loaded via the datasets library with custom configuration for the LM corpus path.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/LibriSpeech/LM/dataset.py
Signature
class LibrispeechLmConfig(datasets.BuilderConfig):
"""Builder config for LibriSpeech LM."""
def __init__(self, **kwargs):
self.lm_corpus_path = kwargs.pop("lm_corpus_path", None)
super().__init__(**kwargs)
def __post_init__(self):
if self.lm_corpus_path is None:
self.lm_corpus_path = _DL_URL
...
class LibrispeechLm(datasets.GeneratorBasedBuilder):
"""Librispeech language modeling dataset."""
VERSION = datasets.Version("0.1.0")
BUILDER_CONFIG_CLASS = LibrispeechLmConfig
def _info(self):
...
def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""
...
def _generate_examples(self, archive_path):
"""Yields examples."""
...
Import
import datasets
# Load using HuggingFace datasets with the custom script
dataset = datasets.load_dataset("./dataset.py", data_files={"train": [...], "test": [...]})
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lm_corpus_path | str | No | Path or URL to the LM corpus file (default: OpenSLR librispeech-lm-norm.txt.gz) |
| data_files | dict | Yes | Mapping of split names to transcript file paths |
Outputs
| Name | Type | Description |
|---|---|---|
| text | str | A single line of text from the LM corpus or transcript (utterance IDs removed) |
Usage Examples
import datasets
# Load with default LM corpus (auto-downloaded from OpenSLR)
lm_dataset = datasets.load_dataset(
"./dataset.py",
data_files={
"train": ["train-960-transcripts.txt"],
"test": ["test-transcripts.txt"],
},
)
# Iterate over examples
for example in lm_dataset["train"]:
text = example["text"]
# Process text for LM training...