Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Pad Or Trim

From Leeroopedia

Overview

whisper.pad_or_trim() standardizes the length of an audio waveform or mel spectrogram to a fixed size by either zero-padding or trimming. This ensures all inputs to the Whisper encoder have the expected temporal dimension.

Source

Signature

def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):

Where N_SAMPLES = 480000 (30 seconds at 16kHz).

Import

from whisper.audio import pad_or_trim
# or
import whisper  # re-exported as whisper.pad_or_trim

Parameters

Parameter Type Default Description
array Union[np.ndarray, torch.Tensor] (required) Audio waveform or mel spectrogram to pad or trim
length int 480000 Target length along the specified axis. Use N_SAMPLES (480000) for waveforms or N_FRAMES (3000) for mel spectrograms
axis int -1 Dimension along which to pad or trim (keyword-only argument)

Inputs and Outputs

Inputs

  • An audio numpy array or torch Tensor of variable length (waveform or spectrogram)

Outputs

  • An array or tensor of exactly the specified length along the given axis, with the same type as the input

Behavior

The function handles both numpy arrays and torch Tensors with type-appropriate operations:

If the input is longer than length (trimming)

  1. Constructs an index tuple that selects the first length elements along the specified axis
  2. Returns the sliced array/tensor

If the input is shorter than length (zero-padding)

  1. Computes the pad width as length - array.shape[axis]
  2. For torch Tensors: uses torch.nn.functional.pad() with zero-padding on the appropriate dimension
  3. For numpy arrays: uses np.pad() with zero-padding on the appropriate axis
  4. Returns the padded array/tensor

If the input is exactly the target length

  1. Returns the array unchanged (no copy is made for trimming; padding branch is not entered)

Example

import whisper
from whisper.audio import N_FRAMES

# Pad or trim a raw audio waveform to 30 seconds
audio = whisper.load_audio("speech.mp3")
print(audio.shape)      # e.g., (352000,) for ~22 seconds of audio
audio = whisper.pad_or_trim(audio)
print(audio.shape)      # (480000,) — padded with zeros to 30 seconds

# Trim a long audio file
long_audio = whisper.load_audio("lecture.wav")
print(long_audio.shape) # e.g., (9600000,) for 10 minutes
trimmed = whisper.pad_or_trim(long_audio)
print(trimmed.shape)    # (480000,) — trimmed to first 30 seconds

# Works on mel spectrograms too
mel = whisper.log_mel_spectrogram(audio)
mel = whisper.pad_or_trim(mel, N_FRAMES)
print(mel.shape)        # torch.Size([80, 3000])

# Custom length for non-standard use
short_clip = whisper.pad_or_trim(audio, length=160000)  # ~10 seconds
print(short_clip.shape) # (160000,)

Notes

  • The default length=480000 corresponds to exactly 30 seconds at 16kHz and should be used for raw waveforms
  • For mel spectrograms, use N_FRAMES=3000 as the length parameter
  • The function preserves the input type: numpy input produces numpy output, torch input produces torch output
  • Zero-padding appends silence at the end of the audio, preserving temporal alignment of the content at the start
  • When processing long audio files, the caller is responsible for implementing a sliding window strategy over 30-second chunks

Metadata

Principle:Openai_Whisper_Audio_Padding_And_Trimming 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment