Principle:Speechbrain Speechbrain Audio Interpretability

Knowledge Sources	Listen to Interpret (L2I) Posthoc Interpretation via Quantization (PIQ) SpeechBrain
Domains	Audio_Classification, Explainability
Last Updated	2026-02-09 00:00 GMT

Overview

Post-hoc interpretation of audio classifier decisions identifies which time-frequency regions of an input spectrogram contribute most to a predicted class label.

Description

Audio interpretability methods produce human-readable explanations for black-box audio classifiers by generating saliency maps over the input spectrogram. Given a trained classifier and an audio input, these methods estimate a time-frequency mask or activation map that highlights the spectral regions responsible for the classification decision. The resulting interpretation can be rendered as audio (by masking the original spectrogram and inverting the STFT) or visualized as a heatmap. This principle encompasses several complementary approaches: Activation Map Thresholding (AMT), Listen to Interpret (L2I), Listenable Maps for Audio Classifiers (L-MAC), Posthoc Interpretation via Quantization (PIQ), and Non-negative Matrix Factorization (NMF).

Usage

Apply these methods when you need to understand why an audio classifier assigned a particular label, when debugging misclassifications, or when building trust in deployed classification systems. These approaches are suitable for any audio classification task where the input is a spectrogram and the classifier produces per-class logits.

Theoretical Basis

Activation Map Thresholding (AMT)

AMT is a by-design interpretation method that extracts attention-like activation maps directly from the internal representations of the classifier. For Vision Transformer (ViT) architectures, AMT uses the CLS token attention weights from the final transformer layer. For FocalNet architectures, it uses the modulation gate outputs. These activations are upsampled to the original spectrogram resolution and thresholded using a quantile parameter:

Given activation map A of shape (T', F'):
1. Upsample A to match spectrogram dimensions (T, F)
2. Compute threshold tau = quantile(A, q)       # e.g., q = 0.8
3. Binary mask: M[t,f] = 1 if A[t,f] >= tau, else 0
4. Interpretation: X_int = M * X_logpower

AMT requires no additional training because it reuses existing classifier internals.

Listen to Interpret (L2I)

L2I trains a separate interpreter network (psi) that takes the classifier's hidden representations and produces a time-frequency mask. The interpreter is trained to maximize input fidelity: when the masked spectrogram is fed back through the classifier, the predicted class should remain the same. The training objective balances fidelity with sparsity:

L_L2I = L_fidelity(f(M * X), f(X)) + lambda * ||M||_1

where f is the classifier, M is the predicted mask, and X is the input spectrogram. The L1 penalty encourages sparse masks that highlight only the most relevant regions.

L-MAC (Listenable Maps for Audio Classifiers)

L-MAC extends L2I by adding a faithfulness objective that directly penalizes the classifier probability drop between the original and masked inputs:

Faithfulness = p(y_pred | X) - p(y_pred | M * X)

A lower faithfulness score indicates that the interpretation preserves the information critical to the classifier's decision.

PIQ (Posthoc Interpretation via Quantization)

PIQ adds a vector quantization bottleneck to the interpreter decoder. The classifier hidden representations are quantized before being decoded into a mask. This quantization acts as an information bottleneck that forces the interpreter to retain only the most class-relevant features, producing cleaner and more compact saliency maps.

h = classifier_encoder(X)           # hidden representations
h_q = VectorQuantize(h)             # quantized bottleneck
M = psi_decoder(h_q)                # interpretation mask
X_int = sigmoid(M) * X_logpower     # masked interpretation

NMF-Based Interpretation

The NMF approach decomposes the magnitude spectrogram into non-negative components and selects those most relevant to the predicted class. Unlike the neural approaches, NMF provides interpretations grounded in a well-understood matrix factorization framework.

Evaluation Metrics

All methods are evaluated using shared metrics:

Input Fidelity (IF): Whether the classifier's top prediction is preserved when classifying the interpretation.
Faithfulness (FF): The difference in classifier probability between the original and masked inputs.
Average Drop (AD): The relative probability drop for the predicted class.
Average Increase (AI): Fraction of samples where the masked input increases the predicted class probability.
Sparseness (SPS): How concentrated the saliency map is.
Complexity (COMP): The entropy of the saliency map.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment