Implementation:Speechbrain Speechbrain Prepare PeoplesSpeech
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing the People's Speech dataset for automatic speech recognition provided by the SpeechBrain library.
Description
This script creates CSV data manifest files for the People's Speech dataset, a large-scale English ASR corpus from MLCommons. Unlike typical SpeechBrain data preparation, this script relies exclusively on HuggingFace Datasets for data access -- audio files are read directly from shards rather than extracted. The CSV files generated contain transcriptions and durations primarily for debugging and monitoring, and are not strictly required to run the training recipe. The script supports multiple data subsets (clean, clean_sac, dirty, dirty_sa) that can be combined.
Usage
Use this when preparing the People's Speech dataset for ASR training with SpeechBrain recipes.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/PeoplesSpeech/ASR/transformer/peoples_speech_prepare.py
Signature
def prepare_peoples_speech(
hf_download_folder: str,
save_folder: str,
subsets: list,
skip_prep: bool = False,
) -> None:
Import
from recipes.PeoplesSpeech.ASR.transformer.peoples_speech_prepare import prepare_peoples_speech
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hf_download_folder | str | Yes | Path where HuggingFace stored the dataset (should match HF_HUB_CACHE env variable) |
| save_folder | str | Yes | Path to the folder where CSV files will be saved |
| subsets | list | Yes | Target subsets to process (e.g. ['clean'], ['clean', 'dirty']) |
| skip_prep | bool | No | If True, skip data preparation (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.csv | CSV | Training split manifest with audio IDs, durations, and transcriptions |
| dev.csv | CSV | Development split manifest |
| test.csv | CSV | Test split manifest |
Usage Examples
from recipes.PeoplesSpeech.ASR.transformer.peoples_speech_prepare import prepare_peoples_speech
prepare_peoples_speech(
hf_download_folder="/path/to/hf_cache",
save_folder="/path/to/output",
subsets=["clean"],
)