Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Whisper Mel Spectrogram Computation

From Leeroopedia

Overview

Mel Spectrogram Computation is the process of converting a time-domain audio waveform into a time-frequency representation suitable for speech recognition. The log-mel spectrogram is the standard audio feature for modern automatic speech recognition (ASR) systems, including Whisper. It approximates human auditory perception through mel-scale frequency warping and compresses the dynamic range through logarithmic scaling.

Theoretical Background

Short-Time Fourier Transform (STFT)

The first step converts the time-domain waveform into a time-frequency representation:

  1. Window the audio signal into overlapping frames using a window function (e.g., Hann window)
  2. Compute the FFT for each frame to obtain the frequency spectrum
  3. Take the magnitude squared to obtain the power spectrum

The STFT parameters determine the time-frequency resolution trade-off:

Parameter Whisper Value Effect
N_FFT (window size) 400 (25ms at 16kHz) Determines frequency resolution
HOP_LENGTH (stride) 160 (10ms at 16kHz) Determines time resolution (frame rate = 100 Hz)
Window function Hann Reduces spectral leakage

Mel Filterbank Projection

The power spectrum is projected onto the mel scale, which approximates human auditory frequency perception. The mel scale is defined by:

mel(f) = 2595 * log10(1 + f / 700)

This mapping has two important properties:

  • Low frequencies are spread out — more filters in the region where human hearing has finer resolution
  • High frequencies are compressed — fewer filters where human hearing is less discriminative

A bank of triangular filters is constructed on the mel scale. Each filter integrates the power spectrum over a range of frequencies, producing a single value per filter per frame. Whisper uses either 80 or 128 mel filters depending on the model variant.

Log Compression

The mel-filtered power values span a very large dynamic range. Logarithmic compression is applied to:

  • Compress the dynamic range — making quiet and loud sounds more comparable
  • Approximate human loudness perception — which is approximately logarithmic
  • Stabilize training — by reducing the variance of input features

The specific log compression in Whisper is:

  1. Apply log10 to the mel spectrogram
  2. Clamp the minimum value to max_value - 8.0 (80dB dynamic range)
  3. Normalize to approximately [-1, 1] range: (log_spec + 4.0) / 4.0

The full computation can be summarized as:

log_spec = clamp(log10(mel_filters @ |STFT|^2), min=max-8)

normalized = (log_spec + 4.0) / 4.0

Output Representation

The resulting log-mel spectrogram is a 2D tensor of shape (n_mels, n_frames) where:

  • n_mels — number of mel frequency bins (80 or 128)
  • n_frames — number of time frames (depends on audio length; 3000 for 30 seconds)

Each column represents the frequency content of a 25ms audio window, with windows spaced 10ms apart.

Key Concepts

  • STFT — decomposes a waveform into overlapping windowed frequency spectra
  • Mel scale — perceptually motivated frequency warping that emphasizes speech-relevant frequencies
  • Triangular filterbank — set of overlapping triangular filters on the mel scale that integrate power spectrum energy
  • Log compression — logarithmic scaling to compress dynamic range and approximate human loudness perception
  • Dynamic range clamping — limits the minimum value to 80dB below the maximum to suppress noise floor
  • Feature normalization — shifts and scales the log spectrogram to an approximately [-1, 1] range for neural network input

References

  • Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. https://arxiv.org/abs/2209.11302
  • Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A Scale for the Measurement of the Psychological Magnitude Pitch. Journal of the Acoustical Society of America.

Metadata

Speech_Recognition Signal_Processing Implementation:Openai_Whisper_Log_Mel_Spectrogram 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment