Implementation:Speechbrain Speechbrain Prepare Switchboard

Knowledge Sources	SpeechBrain
Domains	ASR, Data_Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the Switchboard-1 Release 2 corpus (LDC97S62) for ASR training provided by the SpeechBrain library.

Description

This script prepares data from the Switchboard-1 Release 2 corpus for use in SpeechBrain ASR recipes. It parses the Switchboard transcription format, segments conversations into utterances, and generates CSV manifests for train/dev/test splits. Optionally, the Fisher corpus transcripts (LDC2004T19 and LDC2005T19) can be merged for tokenizer and language model training. The test set is derived from the eval2000/Hub5 data (LDC2002S09/LDC2002T43).

Usage

Use this script to prepare Switchboard data before running any Switchboard ASR, language model, or tokenizer training recipe. The script supports configurable train/dev splits and optional Fisher corpus integration.

Code Reference

Source Location

Repository: SpeechBrain
File: recipes/Switchboard/ASR/seq2seq/switchboard_prepare.py

Signature

def prepare_switchboard(
    data_folder,
    save_folder,
    splits=None,
    split_ratio=None,
    merge_lst=None,
    merge_name=None,
    skip_prep=False,
    add_fisher_corpus=False,
    max_utt=300,
):

Import

from switchboard_prepare import prepare_switchboard

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	str	Yes	Path to the folder where the Switchboard (and optionally Fisher) datasets are stored
save_folder	str	Yes	The directory to store the output CSV files
splits	list	No	List of data splits to generate (default: ["train", "dev"])
split_ratio	list	No	Ratios for splitting the data into the specified splits
merge_lst	list	No	List of CSV files to merge together
merge_name	str	No	Name for the merged output CSV file
skip_prep	bool	No	If True, skips data preparation (default: False)
add_fisher_corpus	bool	No	If True, adds Fisher corpus transcripts to the CSVs (default: False)
max_utt	int	No	Maximum utterance duration in frames (default: 300)

Outputs

Name	Type	Description
train.csv	CSV file	Training set manifest with utterance-level segmentation
dev.csv	CSV file	Development set manifest
test.csv	CSV file	Test set manifest from eval2000/Hub5 data

Usage Examples

from switchboard_prepare import prepare_switchboard

# Basic preparation
prepare_switchboard(
    data_folder="/path/to/Switchboard",
    save_folder="/path/to/output",
)

# With Fisher corpus for LM training
prepare_switchboard(
    data_folder="/path/to/Switchboard",
    save_folder="/path/to/output",
    add_fisher_corpus=True,
    merge_lst=["train.csv", "fisher.csv"],
    merge_name="train_fisher.csv",
)

Related Pages

Principle:Speechbrain_Speechbrain_Dataset_Specific_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment