Principle:Speechbrain Speechbrain Sound Classification Training

Knowledge Sources	Piczak 2015 "ESC: Dataset for Environmental Sound Classification" Gong et al. 2021 "AST: Audio Spectrogram Transformer" Warden 2018 "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition" SpeechBrain
Domains	Audio_Classification, Environmental_Sound, Keyword_Spotting, Deep_Learning
Last Updated	2026-02-09 00:00 GMT

Overview

Sound classification training maps fixed-length or variable-length audio segments to discrete class labels using learned feature representations and a classification head, typically leveraging pretrained audio embeddings or spectral features as input to CNN or Transformer backbones.

Description

Environmental sound classification and keyword spotting are closed-set classification tasks where the goal is to assign an audio clip to one of a predefined set of categories. Unlike ASR, which produces variable-length token sequences, sound classification produces a single class label per input segment. The training pipeline typically follows a two-stage pattern: first, a feature extractor converts raw audio into a compact representation (log-mel spectrograms, MFCC features, or embeddings from pretrained models), and second, a classification network maps these features to class probabilities. Modern approaches increasingly rely on pretrained audio encoders -- models trained on large-scale audio data through self-supervised or supervised pretraining -- which provide rich, transferable representations that dramatically reduce the amount of task-specific labeled data required. The classification head is then trained (and the encoder optionally fine-tuned) using standard cross-entropy loss with softmax output.

Usage

Use this principle when the task requires assigning audio clips to a fixed set of categories, such as environmental sound recognition (siren, dog bark, rain), keyword spotting (yes, no, stop, go), or acoustic scene classification (office, street, park). This approach applies whenever the output is a single discrete label rather than a sequence of tokens. It is also appropriate for audio tagging tasks where multiple labels may apply, by substituting sigmoid activations and binary cross-entropy loss for softmax and categorical cross-entropy.

Theoretical Basis

Feature Extraction

The first stage transforms raw audio waveforms into compact spectral or learned representations:

Option A -- Spectral features:
  X_stft = STFT(waveform, n_fft=400, hop_length=160)
  X_mel  = MelFilterbank(|X_stft|^2, n_mels=80)
  X_feat = log(X_mel + epsilon)
  Result: (T_frames, n_mels) spectrogram

Option B -- Pretrained embeddings:
  X_feat = PretrainedEncoder(waveform)
  Result: (T_frames, d_embed) contextualized embeddings
  Examples: wav2vec2, HuBERT, ECAPA-TDNN embeddings

Classification Architecture

The classification network maps variable-length feature sequences to fixed-dimensional class logits:

1. Backbone encoder:
   h = Backbone(X_feat)
   - CNN: stacked convolutional blocks with batch norm, ReLU, and pooling
   - Transformer: stacked self-attention layers with positional encoding
   - Pretrained: frozen or fine-tuned encoder layers

2. Temporal pooling (aggregate over time):
   h_pool = Pooling(h)
   - Statistics pooling: concatenate mean and standard deviation over time
   - Attention pooling: weighted sum using learned attention weights
   - Global average pooling: simple mean over time dimension

3. Classification head:
   logits = Linear(h_pool)    -- project to num_classes dimensions
   p(class | x) = Softmax(logits)

Training Objective

The standard training objective is categorical cross-entropy:

L_CE = -sum_{c=1}^{C} y_c * log(p_c)

where:
  C   = number of classes
  y_c = 1 if the true class is c, 0 otherwise (one-hot encoding)
  p_c = Softmax(logits)_c = predicted probability for class c

For improved generalization, label smoothing is often applied:

y_smooth_c = (1 - epsilon) * y_c + epsilon / C

where epsilon is a small smoothing constant (typically 0.1)

Data Augmentation

Sound classification benefits from augmentation strategies tailored to audio:

Common augmentation techniques:
  - SpecAugment: random time and frequency masking on spectrograms
  - Time shifting: circular shift of waveform by random offset
  - Noise injection: additive noise at random SNR levels
  - Speed perturbation: resampling at slightly different rates (0.9x-1.1x)
  - Mixup: linear interpolation of two training examples and their labels

Evaluation

Classification performance is measured using:

Accuracy = (number of correct predictions) / (total number of predictions)

For imbalanced datasets, additional metrics are used:
  - Macro F1-score: average F1 across all classes (equal weight per class)
  - Confusion matrix: per-class error analysis
  - k-fold cross-validation: standard for ESC-50 (5 folds)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment