Implementation:Speechbrain Speechbrain Prepare LibriSpeech LM
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing the LibriSpeech dataset for automatic speech recognition and language modeling provided by the SpeechBrain library.
Description
This script creates CSV data manifest files for the LibriSpeech dataset, a large corpus of read English speech derived from LibriVox audiobooks. It processes audio files and transcription text, generates per-split CSV files with audio paths, durations, speaker IDs, and transcriptions, and supports merging multiple splits. It can also create a grapheme-to-phoneme lexicon. The script handles downloading data from OpenSLR if not already present and uses parallel processing for efficient preparation.
Usage
Use this when preparing the LibriSpeech dataset for ASR or language model training with SpeechBrain recipes.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/LibriSpeech/LM/librispeech_prepare.py
Signature
def prepare_librispeech(
data_folder,
save_folder,
tr_splits=[],
dev_splits=[],
te_splits=[],
select_n_sentences=None,
merge_lst=[],
merge_name=None,
create_lexicon=False,
skip_prep=False,
):
Import
from recipes.LibriSpeech.LM.librispeech_prepare import prepare_librispeech
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the folder where the original LibriSpeech dataset is stored |
| save_folder | str | Yes | Directory where CSV files will be stored |
| tr_splits | list | No | List of train splits (e.g. ['train-clean-100', 'train-clean-360', 'train-other-500']) |
| dev_splits | list | No | List of dev splits (e.g. ['dev-clean', 'dev-others']) |
| te_splits | list | No | List of test splits (e.g. ['test-clean', 'test-others']) |
| select_n_sentences | int | No | If set, only use this many sentences (default: None) |
| merge_lst | list | No | List of splits to merge into a single CSV file |
| merge_name | str | No | Name of the merged CSV file |
| create_lexicon | bool | No | If True, output grapheme-to-phoneme mapping CSV files (default: False) |
| skip_prep | bool | No | If True, skip data preparation (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| {split}.csv | CSV | Per-split manifest files with audio paths, durations, speaker IDs, and transcriptions |
| {merge_name}.csv | CSV | Optionally merged CSV combining multiple splits |
Usage Examples
from recipes.LibriSpeech.LM.librispeech_prepare import prepare_librispeech
prepare_librispeech(
data_folder="/path/to/LibriSpeech",
save_folder="/path/to/output",
tr_splits=["train-clean-100", "train-clean-360", "train-other-500"],
dev_splits=["dev-clean", "dev-others"],
te_splits=["test-clean", "test-others"],
)