Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Speechbrain Speechbrain Prepare Switchboard

From Leeroopedia


Knowledge Sources
Domains ASR, Data_Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for preparing the Switchboard-1 Release 2 corpus (LDC97S62) for ASR training provided by the SpeechBrain library.

Description

This script prepares data from the Switchboard-1 Release 2 corpus for use in SpeechBrain ASR recipes. It parses the Switchboard transcription format, segments conversations into utterances, and generates CSV manifests for train/dev/test splits. Optionally, the Fisher corpus transcripts (LDC2004T19 and LDC2005T19) can be merged for tokenizer and language model training. The test set is derived from the eval2000/Hub5 data (LDC2002S09/LDC2002T43).

Usage

Use this script to prepare Switchboard data before running any Switchboard ASR, language model, or tokenizer training recipe. The script supports configurable train/dev splits and optional Fisher corpus integration.

Code Reference

Source Location

Signature

def prepare_switchboard(
    data_folder,
    save_folder,
    splits=None,
    split_ratio=None,
    merge_lst=None,
    merge_name=None,
    skip_prep=False,
    add_fisher_corpus=False,
    max_utt=300,
):

Import

from switchboard_prepare import prepare_switchboard

I/O Contract

Inputs

Name Type Required Description
data_folder str Yes Path to the folder where the Switchboard (and optionally Fisher) datasets are stored
save_folder str Yes The directory to store the output CSV files
splits list No List of data splits to generate (default: ["train", "dev"])
split_ratio list No Ratios for splitting the data into the specified splits
merge_lst list No List of CSV files to merge together
merge_name str No Name for the merged output CSV file
skip_prep bool No If True, skips data preparation (default: False)
add_fisher_corpus bool No If True, adds Fisher corpus transcripts to the CSVs (default: False)
max_utt int No Maximum utterance duration in frames (default: 300)

Outputs

Name Type Description
train.csv CSV file Training set manifest with utterance-level segmentation
dev.csv CSV file Development set manifest
test.csv CSV file Test set manifest from eval2000/Hub5 data

Usage Examples

from switchboard_prepare import prepare_switchboard

# Basic preparation
prepare_switchboard(
    data_folder="/path/to/Switchboard",
    save_folder="/path/to/output",
)

# With Fisher corpus for LM training
prepare_switchboard(
    data_folder="/path/to/Switchboard",
    save_folder="/path/to/output",
    add_fisher_corpus=True,
    merge_lst=["train.csv", "fisher.csv"],
    merge_name="train_fisher.csv",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment