Implementation:Speechbrain Speechbrain Prepare LibriSpeech LM

Knowledge Sources	SpeechBrain
Domains	Speech_Recognition, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the LibriSpeech dataset for automatic speech recognition and language modeling provided by the SpeechBrain library.

Description

This script creates CSV data manifest files for the LibriSpeech dataset, a large corpus of read English speech derived from LibriVox audiobooks. It processes audio files and transcription text, generates per-split CSV files with audio paths, durations, speaker IDs, and transcriptions, and supports merging multiple splits. It can also create a grapheme-to-phoneme lexicon. The script handles downloading data from OpenSLR if not already present and uses parallel processing for efficient preparation.

Usage

Use this when preparing the LibriSpeech dataset for ASR or language model training with SpeechBrain recipes.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/LibriSpeech/LM/librispeech_prepare.py

Signature

def prepare_librispeech(
    data_folder,
    save_folder,
    tr_splits=[],
    dev_splits=[],
    te_splits=[],
    select_n_sentences=None,
    merge_lst=[],
    merge_name=None,
    create_lexicon=False,
    skip_prep=False,
):

Import

from recipes.LibriSpeech.LM.librispeech_prepare import prepare_librispeech

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the folder where the original LibriSpeech dataset is stored
save_folder	str	Yes	Directory where CSV files will be stored
tr_splits	list	No	List of train splits (e.g. ['train-clean-100', 'train-clean-360', 'train-other-500'])
dev_splits	list	No	List of dev splits (e.g. ['dev-clean', 'dev-others'])
te_splits	list	No	List of test splits (e.g. ['test-clean', 'test-others'])
select_n_sentences	int	No	If set, only use this many sentences (default: None)
merge_lst	list	No	List of splits to merge into a single CSV file
merge_name	str	No	Name of the merged CSV file
create_lexicon	bool	No	If True, output grapheme-to-phoneme mapping CSV files (default: False)
skip_prep	bool	No	If True, skip data preparation (default: False)

Outputs

Name	Type	Description
{split}.csv	CSV	Per-split manifest files with audio paths, durations, speaker IDs, and transcriptions
{merge_name}.csv	CSV	Optionally merged CSV combining multiple splits

Usage Examples

from recipes.LibriSpeech.LM.librispeech_prepare import prepare_librispeech

prepare_librispeech(
    data_folder="/path/to/LibriSpeech",
    save_folder="/path/to/output",
    tr_splits=["train-clean-100", "train-clean-360", "train-other-500"],
    dev_splits=["dev-clean", "dev-others"],
    te_splits=["test-clean", "test-others"],
)

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment