Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain Dataset Specific Data Preparation

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Speech_Processing, Corpus_Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

Dataset-specific data preparation is the process of converting raw speech corpus distribution formats into a standardized tabular manifest representation that decouples corpus-specific layouts from downstream training logic.

Description

Speech and audio corpora are distributed in a wide variety of native formats: RTTM annotations for diarization, TSV metadata for crowdsourced datasets, XML transcripts for broadcast speech, Kaldi-style segment files, and many others. Each corpus has its own directory structure, file naming conventions, annotation schema, and audio encoding. Dataset-specific data preparation scripts bridge this gap by parsing corpus-native formats and producing SpeechBrain's standardized CSV or JSON manifest files. These manifests contain uniform columns -- typically ID, duration, wav (audio file path), and task-specific labels such as transcriptions, speaker identifiers, emotion tags, or sound class labels -- enabling any downstream training recipe to consume any prepared corpus without modification.

Usage

Use this principle whenever integrating a new speech or audio corpus into a SpeechBrain training pipeline. Before any model training can begin, a preparation script must be written or invoked that reads the raw corpus files and emits standardized CSV/JSON manifests for each data split (train, dev/valid, test). This is the mandatory first stage of every recipe.

Theoretical Basis

The core algorithmic pattern shared by all dataset preparation scripts follows a deterministic pipeline:

Algorithm: Dataset-Specific Data Preparation

Input:  raw corpus directory D, output directory O, split definitions S
Output: CSV/JSON manifest files {M_train, M_dev, M_test}

1. VALIDATE that D contains expected corpus structure
2. CHECK if output manifests already exist in O (idempotency guard)
   - If all manifests exist and skip_prep is set, RETURN early
3. For each split s in S:
   a. PARSE corpus-native annotation files for split s
      - RTTM files  -> extract speaker, start_time, duration per segment
      - TSV files   -> extract audio_path, transcript, metadata per row
      - XML files   -> extract utterance boundaries, transcripts per element
      - Kaldi files -> extract utterance-id, wav-path, transcript per line
      - TextGrid    -> extract tier intervals with labels
   b. For each utterance u discovered:
      i.   RESOLVE audio file path (absolute or relative to data_root)
      ii.  COMPUTE duration from audio metadata or annotation timestamps
      iii. NORMALIZE text transcription (unicode, casing, punctuation)
      iv.  FILTER invalid entries (missing audio, empty transcript, too short)
      v.   ASSIGN unique utterance ID
   c. WRITE manifest M_s as CSV with columns:
      ID, duration, wav, {task-specific label columns}
4. RETURN paths to generated manifests

Standard Manifest Format

All preparation scripts converge on the same output schema:

ID, duration, wav, spk_id, wrd
utterance_001, 3.42, /data/corpus/wavs/utt001.wav, speaker_A, "hello world"
utterance_002, 5.17, /data/corpus/wavs/utt002.wav, speaker_B, "good morning"

The exact label columns vary by task: ASR manifests include wrd (word-level transcription) and optionally char (character-level); diarization manifests include speaker and start/stop boundaries; classification manifests include class_label or emotion; and language identification manifests include language.

Key Design Properties

  • Idempotency -- preparation scripts detect existing outputs and skip redundant processing, ensuring safe re-runs
  • DDP Safety -- in distributed training, only rank-0 executes preparation while other processes wait, preventing race conditions on filesystem writes
  • Corpus Independence -- downstream training code depends only on the manifest schema, never on raw corpus formats
  • Text Normalization -- transcripts undergo unicode normalization, optional accent stripping, language-specific filtering, and whitespace cleanup to ensure tokenizer consistency
  • Duration-Based Filtering -- utterances outside acceptable duration bounds are excluded to prevent memory issues and improve batching efficiency

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment