Implementation:Speechbrain Speechbrain Prepare Libriheavy

Knowledge Sources	SpeechBrain
Domains	Speech_Recognition, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the Libriheavy dataset for automatic speech recognition provided by the SpeechBrain library.

Description

This script creates CSV data manifest files for the Libriheavy dataset, a large-scale ASR corpus built on Libri-Light audio with heavy transcription annotations. It reads compressed JSONL manifest files, filters segments by duration and word count thresholds, and generates per-split CSV files with audio paths, durations, start times, speaker IDs, and transcriptions. The script supports three training subsets: small (0.5 hours), medium (5,000 hours), and large (50,000 hours).

Usage

Use this when preparing the Libriheavy dataset for ASR training with SpeechBrain recipes.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/Libriheavy/libriheavy_prepare.py

Signature

def prepare_libriheavy(
    data_folder,
    manifest_folder,
    save_folder,
    tr_splits=[],
    dev_splits=[],
    te_splits=[],
    skip_prep=False,
    data_placeholder="data_root",
):

Import

from recipes.Libriheavy.libriheavy_prepare import prepare_libriheavy

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the folder where the original Libri-Light dataset is stored
manifest_folder	str	Yes	Path to the folder where Libriheavy JSONL.gz manifest files are stored
save_folder	str	Yes	Directory where CSV files will be stored
tr_splits	list	No	Train splits to prepare (e.g. ['small'], ['medium'], or ['large'])
dev_splits	list	No	Dev splits to prepare
te_splits	list	No	Test splits to prepare (e.g. ['test_clean', 'test_others'])
skip_prep	bool	No	If True, skip data preparation (default: False)
data_placeholder	str	No	Placeholder string for the data root path in CSVs (default: "data_root")

Outputs

Name	Type	Description
{split}.csv	CSV	Per-split manifest files with audio paths, durations, start times, speaker IDs, and transcriptions

Usage Examples

from recipes.Libriheavy.libriheavy_prepare import prepare_libriheavy

prepare_libriheavy(
    data_folder="/path/to/libri-light",
    manifest_folder="/path/to/libriheavy",
    save_folder="/path/to/output",
    tr_splits=["small"],
    dev_splits=["dev"],
    te_splits=["test_clean", "test_others"],
)

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment