Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Prepare LibriSpeech LM

From Leeroopedia


Knowledge Sources
Domains Speech_Recognition, Data_Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the LibriSpeech dataset for automatic speech recognition and language modeling provided by the SpeechBrain library.

Description

This script creates CSV data manifest files for the LibriSpeech dataset, a large corpus of read English speech derived from LibriVox audiobooks. It processes audio files and transcription text, generates per-split CSV files with audio paths, durations, speaker IDs, and transcriptions, and supports merging multiple splits. It can also create a grapheme-to-phoneme lexicon. The script handles downloading data from OpenSLR if not already present and uses parallel processing for efficient preparation.

Usage

Use this when preparing the LibriSpeech dataset for ASR or language model training with SpeechBrain recipes.

Code Reference

Source Location

Signature

def prepare_librispeech(
    data_folder,
    save_folder,
    tr_splits=[],
    dev_splits=[],
    te_splits=[],
    select_n_sentences=None,
    merge_lst=[],
    merge_name=None,
    create_lexicon=False,
    skip_prep=False,
):

Import

from recipes.LibriSpeech.LM.librispeech_prepare import prepare_librispeech

I/O Contract

Inputs

Name Type Required Description
data_folder str Yes Path to the folder where the original LibriSpeech dataset is stored
save_folder str Yes Directory where CSV files will be stored
tr_splits list No List of train splits (e.g. ['train-clean-100', 'train-clean-360', 'train-other-500'])
dev_splits list No List of dev splits (e.g. ['dev-clean', 'dev-others'])
te_splits list No List of test splits (e.g. ['test-clean', 'test-others'])
select_n_sentences int No If set, only use this many sentences (default: None)
merge_lst list No List of splits to merge into a single CSV file
merge_name str No Name of the merged CSV file
create_lexicon bool No If True, output grapheme-to-phoneme mapping CSV files (default: False)
skip_prep bool No If True, skip data preparation (default: False)

Outputs

Name Type Description
{split}.csv CSV Per-split manifest files with audio paths, durations, speaker IDs, and transcriptions
{merge_name}.csv CSV Optionally merged CSV combining multiple splits

Usage Examples

from recipes.LibriSpeech.LM.librispeech_prepare import prepare_librispeech

prepare_librispeech(
    data_folder="/path/to/LibriSpeech",
    save_folder="/path/to/output",
    tr_splits=["train-clean-100", "train-clean-360", "train-other-500"],
    dev_splits=["dev-clean", "dev-others"],
    te_splits=["test-clean", "test-others"],
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment