Implementation:Openai Whisper Log Mel Spectrogram
Appearance
Overview
whisper.log_mel_spectrogram() computes a log-scaled mel spectrogram from an audio waveform or file path. This is the primary feature extraction function that converts raw audio into the input representation expected by Whisper's encoder.
Source
- File: whisper/audio.py:L110-157
- Repository: https://github.com/openai/whisper
Signature
def log_mel_spectrogram(
audio: Union[str, np.ndarray, torch.Tensor],
n_mels: int = 80,
padding: int = 0,
device: Optional[Union[str, torch.device]] = None,
) -> torch.Tensor:
Import
from whisper.audio import log_mel_spectrogram
# or
import whisper # re-exported as whisper.log_mel_spectrogram
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| audio | Union[str, np.ndarray, torch.Tensor] | (required) | Audio file path (string), numpy waveform array, or torch tensor |
| n_mels | int | 80 | Number of mel frequency bins. Use 80 for most models; use 128 for large-v3 and turbo |
| padding | int | 0 | Number of zero samples to pad on the right side of the audio before STFT computation |
| device | Optional[Union[str, torch.device]] | None | Target device for the output tensor and computation |
Internal Constants
| Constant | Value | Description |
|---|---|---|
| N_FFT | 400 | FFT window size (25ms at 16kHz) |
| HOP_LENGTH | 160 | Hop length between frames (10ms at 16kHz) |
| Window | Hann | Window function applied to each frame |
Inputs and Outputs
Inputs
- An audio file path (string), a numpy array of float32 waveform data, or a torch Tensor of waveform data
Outputs
- A torch.Tensor of shape (n_mels, n_frames) containing the log-mel spectrogram, where n_frames depends on the audio length (approximately num_samples / HOP_LENGTH)
Behavior
- If audio is a string, calls load_audio() to decode the file
- If audio is a numpy array, converts to a torch.Tensor
- If padding > 0, pads the audio with zeros on the right using torch.nn.functional.pad()
- Computes the STFT using torch.stft() with a Hann window of size N_FFT and hop length HOP_LENGTH
- Takes the magnitude squared of the complex STFT output to get the power spectrum
- Loads the pre-computed mel filterbank matrix for the specified n_mels count
- Applies the mel filterbank via matrix multiplication: mel_spec = mel_filters @ magnitudes
- Applies log10 scaling to the mel spectrogram
- Clamps the minimum to maximum_value - 8.0 (80dB dynamic range)
- Normalizes the result: (log_spec + 4.0) / 4.0
- Returns the log-mel spectrogram tensor on the specified device
Example
import whisper
# Compute from a file path directly
mel = whisper.log_mel_spectrogram("speech.mp3")
print(mel.shape) # torch.Size([80, N]) where N depends on audio length
# Compute from a pre-loaded audio array
audio = whisper.load_audio("speech.mp3")
mel = whisper.log_mel_spectrogram(audio)
print(mel.shape) # torch.Size([80, N])
# Use 128 mel bins for large-v3 or turbo models
mel_128 = whisper.log_mel_spectrogram(audio, n_mels=128, device="cuda")
print(mel_128.shape) # torch.Size([128, N])
# With padding for sliding window processing
mel_padded = whisper.log_mel_spectrogram(audio, padding=480000)
print(mel_padded.shape) # torch.Size([80, M]) where M > N due to padding
# Typical full preprocessing pipeline
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
print(mel.shape) # torch.Size([80, 3000]) — ready for encoder
Notes
- The mel filterbank weights are loaded from a pre-computed asset file (assets/mel_filters.npz), not computed at runtime
- The function accepts file paths for convenience, but for batch processing it is more efficient to load audio once and pass numpy arrays or tensors
- The n_mels parameter must match the model's expected input: 80 for most models, 128 for large-v3 and turbo
- The normalization (log_spec + 4.0) / 4.0 maps the typical speech range to approximately [-1, 1]
- The 80dB dynamic range clamp prevents noise floor from dominating the feature representation
Metadata
Principle:Openai_Whisper_Mel_Spectrogram_Computation Environment:Openai_Whisper_PyTorch_CUDA 2025-06-25 00:00 GMT
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment