Principle:Speechbrain Speechbrain Sound Classification Training
| Knowledge Sources | |
|---|---|
| Domains | Audio_Classification, Environmental_Sound, Keyword_Spotting, Deep_Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Sound classification training maps fixed-length or variable-length audio segments to discrete class labels using learned feature representations and a classification head, typically leveraging pretrained audio embeddings or spectral features as input to CNN or Transformer backbones.
Description
Environmental sound classification and keyword spotting are closed-set classification tasks where the goal is to assign an audio clip to one of a predefined set of categories. Unlike ASR, which produces variable-length token sequences, sound classification produces a single class label per input segment. The training pipeline typically follows a two-stage pattern: first, a feature extractor converts raw audio into a compact representation (log-mel spectrograms, MFCC features, or embeddings from pretrained models), and second, a classification network maps these features to class probabilities. Modern approaches increasingly rely on pretrained audio encoders -- models trained on large-scale audio data through self-supervised or supervised pretraining -- which provide rich, transferable representations that dramatically reduce the amount of task-specific labeled data required. The classification head is then trained (and the encoder optionally fine-tuned) using standard cross-entropy loss with softmax output.
Usage
Use this principle when the task requires assigning audio clips to a fixed set of categories, such as environmental sound recognition (siren, dog bark, rain), keyword spotting (yes, no, stop, go), or acoustic scene classification (office, street, park). This approach applies whenever the output is a single discrete label rather than a sequence of tokens. It is also appropriate for audio tagging tasks where multiple labels may apply, by substituting sigmoid activations and binary cross-entropy loss for softmax and categorical cross-entropy.
Theoretical Basis
Feature Extraction
The first stage transforms raw audio waveforms into compact spectral or learned representations:
Option A -- Spectral features:
X_stft = STFT(waveform, n_fft=400, hop_length=160)
X_mel = MelFilterbank(|X_stft|^2, n_mels=80)
X_feat = log(X_mel + epsilon)
Result: (T_frames, n_mels) spectrogram
Option B -- Pretrained embeddings:
X_feat = PretrainedEncoder(waveform)
Result: (T_frames, d_embed) contextualized embeddings
Examples: wav2vec2, HuBERT, ECAPA-TDNN embeddings
Classification Architecture
The classification network maps variable-length feature sequences to fixed-dimensional class logits:
1. Backbone encoder:
h = Backbone(X_feat)
- CNN: stacked convolutional blocks with batch norm, ReLU, and pooling
- Transformer: stacked self-attention layers with positional encoding
- Pretrained: frozen or fine-tuned encoder layers
2. Temporal pooling (aggregate over time):
h_pool = Pooling(h)
- Statistics pooling: concatenate mean and standard deviation over time
- Attention pooling: weighted sum using learned attention weights
- Global average pooling: simple mean over time dimension
3. Classification head:
logits = Linear(h_pool) -- project to num_classes dimensions
p(class | x) = Softmax(logits)
Training Objective
The standard training objective is categorical cross-entropy:
L_CE = -sum_{c=1}^{C} y_c * log(p_c)
where:
C = number of classes
y_c = 1 if the true class is c, 0 otherwise (one-hot encoding)
p_c = Softmax(logits)_c = predicted probability for class c
For improved generalization, label smoothing is often applied:
y_smooth_c = (1 - epsilon) * y_c + epsilon / C
where epsilon is a small smoothing constant (typically 0.1)
Data Augmentation
Sound classification benefits from augmentation strategies tailored to audio:
Common augmentation techniques:
- SpecAugment: random time and frequency masking on spectrograms
- Time shifting: circular shift of waveform by random offset
- Noise injection: additive noise at random SNR levels
- Speed perturbation: resampling at slightly different rates (0.9x-1.1x)
- Mixup: linear interpolation of two training examples and their labels
Evaluation
Classification performance is measured using:
Accuracy = (number of correct predictions) / (total number of predictions)
For imbalanced datasets, additional metrics are used:
- Macro F1-score: average F1 across all classes (equal weight per class)
- Confusion matrix: per-class error analysis
- k-fold cross-validation: standard for ESC-50 (5 folds)