Implementation:Openai Whisper Log Mel Spectrogram

Overview

whisper.log_mel_spectrogram() computes a log-scaled mel spectrogram from an audio waveform or file path. This is the primary feature extraction function that converts raw audio into the input representation expected by Whisper's encoder.

Source

File: whisper/audio.py:L110-157
Repository: https://github.com/openai/whisper

Signature

def log_mel_spectrogram(
    audio: Union[str, np.ndarray, torch.Tensor],
    n_mels: int = 80,
    padding: int = 0,
    device: Optional[Union[str, torch.device]] = None,
) -> torch.Tensor:

Import

from whisper.audio import log_mel_spectrogram
# or
import whisper  # re-exported as whisper.log_mel_spectrogram

Parameters

Parameter	Type	Default	Description
audio	Union[str, np.ndarray, torch.Tensor]	(required)	Audio file path (string), numpy waveform array, or torch tensor
n_mels	int	80	Number of mel frequency bins. Use 80 for most models; use 128 for large-v3 and turbo
padding	int	0	Number of zero samples to pad on the right side of the audio before STFT computation
device	Optional[Union[str, torch.device]]	None	Target device for the output tensor and computation

Internal Constants

Constant	Value	Description
N_FFT	400	FFT window size (25ms at 16kHz)
HOP_LENGTH	160	Hop length between frames (10ms at 16kHz)
Window	Hann	Window function applied to each frame

Inputs and Outputs

Inputs

An audio file path (string), a numpy array of float32 waveform data, or a torch Tensor of waveform data

Outputs

A torch.Tensor of shape (n_mels, n_frames) containing the log-mel spectrogram, where n_frames depends on the audio length (approximately num_samples / HOP_LENGTH)

Behavior

If audio is a string, calls load_audio() to decode the file
If audio is a numpy array, converts to a torch.Tensor
If padding > 0, pads the audio with zeros on the right using torch.nn.functional.pad()
Computes the STFT using torch.stft() with a Hann window of size N_FFT and hop length HOP_LENGTH
Takes the magnitude squared of the complex STFT output to get the power spectrum
Loads the pre-computed mel filterbank matrix for the specified n_mels count
Applies the mel filterbank via matrix multiplication: mel_spec = mel_filters @ magnitudes
Applies log10 scaling to the mel spectrogram
Clamps the minimum to maximum_value - 8.0 (80dB dynamic range)
Normalizes the result: (log_spec + 4.0) / 4.0
Returns the log-mel spectrogram tensor on the specified device

Example

import whisper

# Compute from a file path directly
mel = whisper.log_mel_spectrogram("speech.mp3")
print(mel.shape)    # torch.Size([80, N]) where N depends on audio length

# Compute from a pre-loaded audio array
audio = whisper.load_audio("speech.mp3")
mel = whisper.log_mel_spectrogram(audio)
print(mel.shape)    # torch.Size([80, N])

# Use 128 mel bins for large-v3 or turbo models
mel_128 = whisper.log_mel_spectrogram(audio, n_mels=128, device="cuda")
print(mel_128.shape)  # torch.Size([128, N])

# With padding for sliding window processing
mel_padded = whisper.log_mel_spectrogram(audio, padding=480000)
print(mel_padded.shape)  # torch.Size([80, M]) where M > N due to padding

# Typical full preprocessing pipeline
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
print(mel.shape)    # torch.Size([80, 3000]) — ready for encoder

Notes

The mel filterbank weights are loaded from a pre-computed asset file (assets/mel_filters.npz), not computed at runtime
The function accepts file paths for convenience, but for batch processing it is more efficient to load audio once and pass numpy arrays or tensors
The n_mels parameter must match the model's expected input: 80 for most models, 128 for large-v3 and turbo
The normalization (log_spec + 4.0) / 4.0 maps the typical speech range to approximately [-1, 1]
The 80dB dynamic range clamp prevents noise floor from dominating the feature representation

Metadata

Principle:Openai_Whisper_Mel_Spectrogram_Computation Environment:Openai_Whisper_PyTorch_CUDA 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment