Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Log Mel Spectrogram

From Leeroopedia

Overview

whisper.log_mel_spectrogram() computes a log-scaled mel spectrogram from an audio waveform or file path. This is the primary feature extraction function that converts raw audio into the input representation expected by Whisper's encoder.

Source

Signature

def log_mel_spectrogram(
    audio: Union[str, np.ndarray, torch.Tensor],
    n_mels: int = 80,
    padding: int = 0,
    device: Optional[Union[str, torch.device]] = None,
) -> torch.Tensor:

Import

from whisper.audio import log_mel_spectrogram
# or
import whisper  # re-exported as whisper.log_mel_spectrogram

Parameters

Parameter Type Default Description
audio Union[str, np.ndarray, torch.Tensor] (required) Audio file path (string), numpy waveform array, or torch tensor
n_mels int 80 Number of mel frequency bins. Use 80 for most models; use 128 for large-v3 and turbo
padding int 0 Number of zero samples to pad on the right side of the audio before STFT computation
device Optional[Union[str, torch.device]] None Target device for the output tensor and computation

Internal Constants

Constant Value Description
N_FFT 400 FFT window size (25ms at 16kHz)
HOP_LENGTH 160 Hop length between frames (10ms at 16kHz)
Window Hann Window function applied to each frame

Inputs and Outputs

Inputs

  • An audio file path (string), a numpy array of float32 waveform data, or a torch Tensor of waveform data

Outputs

  • A torch.Tensor of shape (n_mels, n_frames) containing the log-mel spectrogram, where n_frames depends on the audio length (approximately num_samples / HOP_LENGTH)

Behavior

  1. If audio is a string, calls load_audio() to decode the file
  2. If audio is a numpy array, converts to a torch.Tensor
  3. If padding > 0, pads the audio with zeros on the right using torch.nn.functional.pad()
  4. Computes the STFT using torch.stft() with a Hann window of size N_FFT and hop length HOP_LENGTH
  5. Takes the magnitude squared of the complex STFT output to get the power spectrum
  6. Loads the pre-computed mel filterbank matrix for the specified n_mels count
  7. Applies the mel filterbank via matrix multiplication: mel_spec = mel_filters @ magnitudes
  8. Applies log10 scaling to the mel spectrogram
  9. Clamps the minimum to maximum_value - 8.0 (80dB dynamic range)
  10. Normalizes the result: (log_spec + 4.0) / 4.0
  11. Returns the log-mel spectrogram tensor on the specified device

Example

import whisper

# Compute from a file path directly
mel = whisper.log_mel_spectrogram("speech.mp3")
print(mel.shape)    # torch.Size([80, N]) where N depends on audio length

# Compute from a pre-loaded audio array
audio = whisper.load_audio("speech.mp3")
mel = whisper.log_mel_spectrogram(audio)
print(mel.shape)    # torch.Size([80, N])

# Use 128 mel bins for large-v3 or turbo models
mel_128 = whisper.log_mel_spectrogram(audio, n_mels=128, device="cuda")
print(mel_128.shape)  # torch.Size([128, N])

# With padding for sliding window processing
mel_padded = whisper.log_mel_spectrogram(audio, padding=480000)
print(mel_padded.shape)  # torch.Size([80, M]) where M > N due to padding

# Typical full preprocessing pipeline
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
print(mel.shape)    # torch.Size([80, 3000]) — ready for encoder

Notes

  • The mel filterbank weights are loaded from a pre-computed asset file (assets/mel_filters.npz), not computed at runtime
  • The function accepts file paths for convenience, but for batch processing it is more efficient to load audio once and pass numpy arrays or tensors
  • The n_mels parameter must match the model's expected input: 80 for most models, 128 for large-v3 and turbo
  • The normalization (log_spec + 4.0) / 4.0 maps the typical speech range to approximately [-1, 1]
  • The 80dB dynamic range clamp prevents noise floor from dominating the feature representation

Metadata

Principle:Openai_Whisper_Mel_Spectrogram_Computation Environment:Openai_Whisper_PyTorch_CUDA 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment