Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Prepare Libriheavy

From Leeroopedia


Knowledge Sources
Domains Speech_Recognition, Data_Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the Libriheavy dataset for automatic speech recognition provided by the SpeechBrain library.

Description

This script creates CSV data manifest files for the Libriheavy dataset, a large-scale ASR corpus built on Libri-Light audio with heavy transcription annotations. It reads compressed JSONL manifest files, filters segments by duration and word count thresholds, and generates per-split CSV files with audio paths, durations, start times, speaker IDs, and transcriptions. The script supports three training subsets: small (0.5 hours), medium (5,000 hours), and large (50,000 hours).

Usage

Use this when preparing the Libriheavy dataset for ASR training with SpeechBrain recipes.

Code Reference

Source Location

Signature

def prepare_libriheavy(
    data_folder,
    manifest_folder,
    save_folder,
    tr_splits=[],
    dev_splits=[],
    te_splits=[],
    skip_prep=False,
    data_placeholder="data_root",
):

Import

from recipes.Libriheavy.libriheavy_prepare import prepare_libriheavy

I/O Contract

Inputs

Name Type Required Description
data_folder str Yes Path to the folder where the original Libri-Light dataset is stored
manifest_folder str Yes Path to the folder where Libriheavy JSONL.gz manifest files are stored
save_folder str Yes Directory where CSV files will be stored
tr_splits list No Train splits to prepare (e.g. ['small'], ['medium'], or ['large'])
dev_splits list No Dev splits to prepare
te_splits list No Test splits to prepare (e.g. ['test_clean', 'test_others'])
skip_prep bool No If True, skip data preparation (default: False)
data_placeholder str No Placeholder string for the data root path in CSVs (default: "data_root")

Outputs

Name Type Description
{split}.csv CSV Per-split manifest files with audio paths, durations, start times, speaker IDs, and transcriptions

Usage Examples

from recipes.Libriheavy.libriheavy_prepare import prepare_libriheavy

prepare_libriheavy(
    data_folder="/path/to/libri-light",
    manifest_folder="/path/to/libriheavy",
    save_folder="/path/to/output",
    tr_splits=["small"],
    dev_splits=["dev"],
    te_splits=["test_clean", "test_others"],
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment