Implementation:Speechbrain Speechbrain Prepare PeoplesSpeech

Knowledge Sources	SpeechBrain
Domains	Speech_Recognition, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the People's Speech dataset for automatic speech recognition provided by the SpeechBrain library.

Description

This script creates CSV data manifest files for the People's Speech dataset, a large-scale English ASR corpus from MLCommons. Unlike typical SpeechBrain data preparation, this script relies exclusively on HuggingFace Datasets for data access -- audio files are read directly from shards rather than extracted. The CSV files generated contain transcriptions and durations primarily for debugging and monitoring, and are not strictly required to run the training recipe. The script supports multiple data subsets (clean, clean_sac, dirty, dirty_sa) that can be combined.

Usage

Use this when preparing the People's Speech dataset for ASR training with SpeechBrain recipes.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/PeoplesSpeech/ASR/transformer/peoples_speech_prepare.py

Signature

def prepare_peoples_speech(
    hf_download_folder: str,
    save_folder: str,
    subsets: list,
    skip_prep: bool = False,
) -> None:

Import

from recipes.PeoplesSpeech.ASR.transformer.peoples_speech_prepare import prepare_peoples_speech

I/O Contract

Inputs

Name	Type	Required	Description
hf_download_folder	str	Yes	Path where HuggingFace stored the dataset (should match HF_HUB_CACHE env variable)
save_folder	str	Yes	Path to the folder where CSV files will be saved
subsets	list	Yes	Target subsets to process (e.g. ['clean'], ['clean', 'dirty'])
skip_prep	bool	No	If True, skip data preparation (default: False)

Outputs

Name	Type	Description
train.csv	CSV	Training split manifest with audio IDs, durations, and transcriptions
dev.csv	CSV	Development split manifest
test.csv	CSV	Test split manifest

Usage Examples

from recipes.PeoplesSpeech.ASR.transformer.peoples_speech_prepare import prepare_peoples_speech

prepare_peoples_speech(
    hf_download_folder="/path/to/hf_cache",
    save_folder="/path/to/output",
    subsets=["clean"],
)

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment