Principle:Openai Whisper Mel Spectrogram Computation
Overview
Mel Spectrogram Computation is the process of converting a time-domain audio waveform into a time-frequency representation suitable for speech recognition. The log-mel spectrogram is the standard audio feature for modern automatic speech recognition (ASR) systems, including Whisper. It approximates human auditory perception through mel-scale frequency warping and compresses the dynamic range through logarithmic scaling.
Theoretical Background
Short-Time Fourier Transform (STFT)
The first step converts the time-domain waveform into a time-frequency representation:
- Window the audio signal into overlapping frames using a window function (e.g., Hann window)
- Compute the FFT for each frame to obtain the frequency spectrum
- Take the magnitude squared to obtain the power spectrum
The STFT parameters determine the time-frequency resolution trade-off:
| Parameter | Whisper Value | Effect |
|---|---|---|
| N_FFT (window size) | 400 (25ms at 16kHz) | Determines frequency resolution |
| HOP_LENGTH (stride) | 160 (10ms at 16kHz) | Determines time resolution (frame rate = 100 Hz) |
| Window function | Hann | Reduces spectral leakage |
Mel Filterbank Projection
The power spectrum is projected onto the mel scale, which approximates human auditory frequency perception. The mel scale is defined by:
mel(f) = 2595 * log10(1 + f / 700)
This mapping has two important properties:
- Low frequencies are spread out — more filters in the region where human hearing has finer resolution
- High frequencies are compressed — fewer filters where human hearing is less discriminative
A bank of triangular filters is constructed on the mel scale. Each filter integrates the power spectrum over a range of frequencies, producing a single value per filter per frame. Whisper uses either 80 or 128 mel filters depending on the model variant.
Log Compression
The mel-filtered power values span a very large dynamic range. Logarithmic compression is applied to:
- Compress the dynamic range — making quiet and loud sounds more comparable
- Approximate human loudness perception — which is approximately logarithmic
- Stabilize training — by reducing the variance of input features
The specific log compression in Whisper is:
- Apply log10 to the mel spectrogram
- Clamp the minimum value to max_value - 8.0 (80dB dynamic range)
- Normalize to approximately [-1, 1] range: (log_spec + 4.0) / 4.0
The full computation can be summarized as:
log_spec = clamp(log10(mel_filters @ |STFT|^2), min=max-8)
normalized = (log_spec + 4.0) / 4.0
Output Representation
The resulting log-mel spectrogram is a 2D tensor of shape (n_mels, n_frames) where:
- n_mels — number of mel frequency bins (80 or 128)
- n_frames — number of time frames (depends on audio length; 3000 for 30 seconds)
Each column represents the frequency content of a 25ms audio window, with windows spaced 10ms apart.
Key Concepts
- STFT — decomposes a waveform into overlapping windowed frequency spectra
- Mel scale — perceptually motivated frequency warping that emphasizes speech-relevant frequencies
- Triangular filterbank — set of overlapping triangular filters on the mel scale that integrate power spectrum energy
- Log compression — logarithmic scaling to compress dynamic range and approximate human loudness perception
- Dynamic range clamping — limits the minimum value to 80dB below the maximum to suppress noise floor
- Feature normalization — shifts and scales the log spectrogram to an approximately [-1, 1] range for neural network input
References
- Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. https://arxiv.org/abs/2209.11302
- Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A Scale for the Measurement of the Psychological Magnitude Pitch. Journal of the Acoustical Society of America.
Metadata
Speech_Recognition Signal_Processing Implementation:Openai_Whisper_Log_Mel_Spectrogram 2025-06-25 00:00 GMT