Implementation:Openai Whisper Pad Or Trim
Appearance
Overview
whisper.pad_or_trim() standardizes the length of an audio waveform or mel spectrogram to a fixed size by either zero-padding or trimming. This ensures all inputs to the Whisper encoder have the expected temporal dimension.
Source
- File: whisper/audio.py:L65-88
- Repository: https://github.com/openai/whisper
Signature
def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):
Where N_SAMPLES = 480000 (30 seconds at 16kHz).
Import
from whisper.audio import pad_or_trim
# or
import whisper # re-exported as whisper.pad_or_trim
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| array | Union[np.ndarray, torch.Tensor] | (required) | Audio waveform or mel spectrogram to pad or trim |
| length | int | 480000 | Target length along the specified axis. Use N_SAMPLES (480000) for waveforms or N_FRAMES (3000) for mel spectrograms |
| axis | int | -1 | Dimension along which to pad or trim (keyword-only argument) |
Inputs and Outputs
Inputs
- An audio numpy array or torch Tensor of variable length (waveform or spectrogram)
Outputs
- An array or tensor of exactly the specified length along the given axis, with the same type as the input
Behavior
The function handles both numpy arrays and torch Tensors with type-appropriate operations:
If the input is longer than length (trimming)
- Constructs an index tuple that selects the first length elements along the specified axis
- Returns the sliced array/tensor
If the input is shorter than length (zero-padding)
- Computes the pad width as length - array.shape[axis]
- For torch Tensors: uses torch.nn.functional.pad() with zero-padding on the appropriate dimension
- For numpy arrays: uses np.pad() with zero-padding on the appropriate axis
- Returns the padded array/tensor
If the input is exactly the target length
- Returns the array unchanged (no copy is made for trimming; padding branch is not entered)
Example
import whisper
from whisper.audio import N_FRAMES
# Pad or trim a raw audio waveform to 30 seconds
audio = whisper.load_audio("speech.mp3")
print(audio.shape) # e.g., (352000,) for ~22 seconds of audio
audio = whisper.pad_or_trim(audio)
print(audio.shape) # (480000,) — padded with zeros to 30 seconds
# Trim a long audio file
long_audio = whisper.load_audio("lecture.wav")
print(long_audio.shape) # e.g., (9600000,) for 10 minutes
trimmed = whisper.pad_or_trim(long_audio)
print(trimmed.shape) # (480000,) — trimmed to first 30 seconds
# Works on mel spectrograms too
mel = whisper.log_mel_spectrogram(audio)
mel = whisper.pad_or_trim(mel, N_FRAMES)
print(mel.shape) # torch.Size([80, 3000])
# Custom length for non-standard use
short_clip = whisper.pad_or_trim(audio, length=160000) # ~10 seconds
print(short_clip.shape) # (160000,)
Notes
- The default length=480000 corresponds to exactly 30 seconds at 16kHz and should be used for raw waveforms
- For mel spectrograms, use N_FRAMES=3000 as the length parameter
- The function preserves the input type: numpy input produces numpy output, torch input produces torch output
- Zero-padding appends silence at the end of the audio, preserving temporal alignment of the content at the start
- When processing long audio files, the caller is responsible for implementing a sliding window strategy over 30-second chunks
Metadata
Principle:Openai_Whisper_Audio_Padding_And_Trimming 2025-06-25 00:00 GMT
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment