Implementation:Speechbrain Speechbrain Prepare Switchboard
| Knowledge Sources | |
|---|---|
| Domains | ASR, Data_Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for preparing the Switchboard-1 Release 2 corpus (LDC97S62) for ASR training provided by the SpeechBrain library.
Description
This script prepares data from the Switchboard-1 Release 2 corpus for use in SpeechBrain ASR recipes. It parses the Switchboard transcription format, segments conversations into utterances, and generates CSV manifests for train/dev/test splits. Optionally, the Fisher corpus transcripts (LDC2004T19 and LDC2005T19) can be merged for tokenizer and language model training. The test set is derived from the eval2000/Hub5 data (LDC2002S09/LDC2002T43).
Usage
Use this script to prepare Switchboard data before running any Switchboard ASR, language model, or tokenizer training recipe. The script supports configurable train/dev splits and optional Fisher corpus integration.
Code Reference
Source Location
- Repository: SpeechBrain
- File: recipes/Switchboard/ASR/seq2seq/switchboard_prepare.py
Signature
def prepare_switchboard(
data_folder,
save_folder,
splits=None,
split_ratio=None,
merge_lst=None,
merge_name=None,
skip_prep=False,
add_fisher_corpus=False,
max_utt=300,
):
Import
from switchboard_prepare import prepare_switchboard
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | str | Yes | Path to the folder where the Switchboard (and optionally Fisher) datasets are stored |
| save_folder | str | Yes | The directory to store the output CSV files |
| splits | list | No | List of data splits to generate (default: ["train", "dev"]) |
| split_ratio | list | No | Ratios for splitting the data into the specified splits |
| merge_lst | list | No | List of CSV files to merge together |
| merge_name | str | No | Name for the merged output CSV file |
| skip_prep | bool | No | If True, skips data preparation (default: False) |
| add_fisher_corpus | bool | No | If True, adds Fisher corpus transcripts to the CSVs (default: False) |
| max_utt | int | No | Maximum utterance duration in frames (default: 300) |
Outputs
| Name | Type | Description |
|---|---|---|
| train.csv | CSV file | Training set manifest with utterance-level segmentation |
| dev.csv | CSV file | Development set manifest |
| test.csv | CSV file | Test set manifest from eval2000/Hub5 data |
Usage Examples
from switchboard_prepare import prepare_switchboard
# Basic preparation
prepare_switchboard(
data_folder="/path/to/Switchboard",
save_folder="/path/to/output",
)
# With Fisher corpus for LM training
prepare_switchboard(
data_folder="/path/to/Switchboard",
save_folder="/path/to/output",
add_fisher_corpus=True,
merge_lst=["train.csv", "fisher.csv"],
merge_name="train_fisher.csv",
)