Implementation:Openai Whisper Pad Or Trim

Overview

whisper.pad_or_trim() standardizes the length of an audio waveform or mel spectrogram to a fixed size by either zero-padding or trimming. This ensures all inputs to the Whisper encoder have the expected temporal dimension.

Source

File: whisper/audio.py:L65-88
Repository: https://github.com/openai/whisper

Signature

def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):

Where N_SAMPLES = 480000 (30 seconds at 16kHz).

Import

from whisper.audio import pad_or_trim
# or
import whisper  # re-exported as whisper.pad_or_trim

Parameters

Parameter	Type	Default	Description
array	Union[np.ndarray, torch.Tensor]	(required)	Audio waveform or mel spectrogram to pad or trim
length	int	480000	Target length along the specified axis. Use N_SAMPLES (480000) for waveforms or N_FRAMES (3000) for mel spectrograms
axis	int	-1	Dimension along which to pad or trim (keyword-only argument)

Inputs and Outputs

Inputs

An audio numpy array or torch Tensor of variable length (waveform or spectrogram)

Outputs

An array or tensor of exactly the specified length along the given axis, with the same type as the input

Behavior

The function handles both numpy arrays and torch Tensors with type-appropriate operations:

If the input is longer than length (trimming)

Constructs an index tuple that selects the first length elements along the specified axis
Returns the sliced array/tensor

If the input is shorter than length (zero-padding)

Computes the pad width as length - array.shape[axis]
For torch Tensors: uses torch.nn.functional.pad() with zero-padding on the appropriate dimension
For numpy arrays: uses np.pad() with zero-padding on the appropriate axis
Returns the padded array/tensor

If the input is exactly the target length

Returns the array unchanged (no copy is made for trimming; padding branch is not entered)

Example

import whisper
from whisper.audio import N_FRAMES

# Pad or trim a raw audio waveform to 30 seconds
audio = whisper.load_audio("speech.mp3")
print(audio.shape)      # e.g., (352000,) for ~22 seconds of audio
audio = whisper.pad_or_trim(audio)
print(audio.shape)      # (480000,) — padded with zeros to 30 seconds

# Trim a long audio file
long_audio = whisper.load_audio("lecture.wav")
print(long_audio.shape) # e.g., (9600000,) for 10 minutes
trimmed = whisper.pad_or_trim(long_audio)
print(trimmed.shape)    # (480000,) — trimmed to first 30 seconds

# Works on mel spectrograms too
mel = whisper.log_mel_spectrogram(audio)
mel = whisper.pad_or_trim(mel, N_FRAMES)
print(mel.shape)        # torch.Size([80, 3000])

# Custom length for non-standard use
short_clip = whisper.pad_or_trim(audio, length=160000)  # ~10 seconds
print(short_clip.shape) # (160000,)

Notes

The default length=480000 corresponds to exactly 30 seconds at 16kHz and should be used for raw waveforms
For mel spectrograms, use N_FRAMES=3000 as the length parameter
The function preserves the input type: numpy input produces numpy output, torch input produces torch output
Zero-padding appends silence at the end of the audio, preserving temporal alignment of the content at the start
When processing long audio files, the caller is responsible for implementing a sliding window strategy over 30-second chunks

Metadata

Principle:Openai_Whisper_Audio_Padding_And_Trimming 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment