Principle:Openai Whisper Mel Spectrogram Computation

Overview

Mel Spectrogram Computation is the process of converting a time-domain audio waveform into a time-frequency representation suitable for speech recognition. The log-mel spectrogram is the standard audio feature for modern automatic speech recognition (ASR) systems, including Whisper. It approximates human auditory perception through mel-scale frequency warping and compresses the dynamic range through logarithmic scaling.

Theoretical Background

Short-Time Fourier Transform (STFT)

The first step converts the time-domain waveform into a time-frequency representation:

Window the audio signal into overlapping frames using a window function (e.g., Hann window)
Compute the FFT for each frame to obtain the frequency spectrum
Take the magnitude squared to obtain the power spectrum

The STFT parameters determine the time-frequency resolution trade-off:

Parameter	Whisper Value	Effect
N_FFT (window size)	400 (25ms at 16kHz)	Determines frequency resolution
HOP_LENGTH (stride)	160 (10ms at 16kHz)	Determines time resolution (frame rate = 100 Hz)
Window function	Hann	Reduces spectral leakage

Mel Filterbank Projection

The power spectrum is projected onto the mel scale, which approximates human auditory frequency perception. The mel scale is defined by:

mel(f) = 2595 * log10(1 + f / 700)

This mapping has two important properties:

Low frequencies are spread out — more filters in the region where human hearing has finer resolution
High frequencies are compressed — fewer filters where human hearing is less discriminative

A bank of triangular filters is constructed on the mel scale. Each filter integrates the power spectrum over a range of frequencies, producing a single value per filter per frame. Whisper uses either 80 or 128 mel filters depending on the model variant.

Log Compression

The mel-filtered power values span a very large dynamic range. Logarithmic compression is applied to:

Compress the dynamic range — making quiet and loud sounds more comparable
Approximate human loudness perception — which is approximately logarithmic
Stabilize training — by reducing the variance of input features

The specific log compression in Whisper is:

Apply log10 to the mel spectrogram
Clamp the minimum value to max_value - 8.0 (80dB dynamic range)
Normalize to approximately [-1, 1] range: (log_spec + 4.0) / 4.0

The full computation can be summarized as:

log_spec = clamp(log10(mel_filters @ |STFT|^2), min=max-8)

normalized = (log_spec + 4.0) / 4.0

Output Representation

The resulting log-mel spectrogram is a 2D tensor of shape (n_mels, n_frames) where:

n_mels — number of mel frequency bins (80 or 128)
n_frames — number of time frames (depends on audio length; 3000 for 30 seconds)

Each column represents the frequency content of a 25ms audio window, with windows spaced 10ms apart.

Key Concepts

STFT — decomposes a waveform into overlapping windowed frequency spectra
Mel scale — perceptually motivated frequency warping that emphasizes speech-relevant frequencies
Triangular filterbank — set of overlapping triangular filters on the mel scale that integrate power spectrum energy
Log compression — logarithmic scaling to compress dynamic range and approximate human loudness perception
Dynamic range clamping — limits the minimum value to 80dB below the maximum to suppress noise floor
Feature normalization — shifts and scales the log spectrogram to an approximately [-1, 1] range for neural network input

References

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. https://arxiv.org/abs/2209.11302
Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A Scale for the Measurement of the Psychological Magnitude Pitch. Journal of the Acoustical Society of America.

Metadata

Speech_Recognition Signal_Processing Implementation:Openai_Whisper_Log_Mel_Spectrogram 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment