Implementation:Speechbrain Speechbrain Prepare Libriheavy
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing the Libriheavy dataset for automatic speech recognition provided by the SpeechBrain library.
Description
This script creates CSV data manifest files for the Libriheavy dataset, a large-scale ASR corpus built on Libri-Light audio with heavy transcription annotations. It reads compressed JSONL manifest files, filters segments by duration and word count thresholds, and generates per-split CSV files with audio paths, durations, start times, speaker IDs, and transcriptions. The script supports three training subsets: small (0.5 hours), medium (5,000 hours), and large (50,000 hours).
Usage
Use this when preparing the Libriheavy dataset for ASR training with SpeechBrain recipes.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/Libriheavy/libriheavy_prepare.py
Signature
def prepare_libriheavy(
data_folder,
manifest_folder,
save_folder,
tr_splits=[],
dev_splits=[],
te_splits=[],
skip_prep=False,
data_placeholder="data_root",
):
Import
from recipes.Libriheavy.libriheavy_prepare import prepare_libriheavy
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the folder where the original Libri-Light dataset is stored |
| manifest_folder | str | Yes | Path to the folder where Libriheavy JSONL.gz manifest files are stored |
| save_folder | str | Yes | Directory where CSV files will be stored |
| tr_splits | list | No | Train splits to prepare (e.g. ['small'], ['medium'], or ['large']) |
| dev_splits | list | No | Dev splits to prepare |
| te_splits | list | No | Test splits to prepare (e.g. ['test_clean', 'test_others']) |
| skip_prep | bool | No | If True, skip data preparation (default: False) |
| data_placeholder | str | No | Placeholder string for the data root path in CSVs (default: "data_root") |
Outputs
| Name | Type | Description |
|---|---|---|
| {split}.csv | CSV | Per-split manifest files with audio paths, durations, start times, speaker IDs, and transcriptions |
Usage Examples
from recipes.Libriheavy.libriheavy_prepare import prepare_libriheavy
prepare_libriheavy(
data_folder="/path/to/libri-light",
manifest_folder="/path/to/libriheavy",
save_folder="/path/to/output",
tr_splits=["small"],
dev_splits=["dev"],
te_splits=["test_clean", "test_others"],
)