Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Load Audio

From Leeroopedia

Overview

whisper.load_audio() decodes an audio file of any format into a normalized mono float32 numpy array at a specified sample rate. It uses ffmpeg as a subprocess to handle universal format support, resampling, and channel downmixing.

Source

Signature

def load_audio(file: str, sr: int = SAMPLE_RATE) -> np.ndarray:

Where SAMPLE_RATE = 16000.

Import

from whisper.audio import load_audio
# or
import whisper  # re-exported as whisper.load_audio

Parameters

Parameter Type Default Description
file str (required) Path to the audio file. Accepts any format that ffmpeg supports (MP3, WAV, FLAC, OGG, M4A, MP4, WebM, etc.)
sr int 16000 Target sample rate in Hz. The audio is resampled to this rate.

Inputs and Outputs

Inputs

  • An audio file path string pointing to a file in any format supported by ffmpeg

Outputs

  • A numpy.ndarray of dtype float32 containing the mono waveform normalized to the range [-1.0, 1.0], with shape (num_samples,)

Behavior

  1. Spawns an ffmpeg subprocess with the following pipeline:
    • -i {file} — reads the input file
    • -f s16le — outputs raw 16-bit signed little-endian PCM
    • -acodec pcm_s16le — uses PCM codec for output
    • -ac 1 — downmixes to mono (single channel)
    • -ar {sr} — resamples to the target sample rate
    • pipe:1 — streams output to stdout
  2. Reads the raw PCM bytes from the subprocess stdout
  3. Converts to numpy array using np.frombuffer(out, np.int16)
  4. Normalizes to float32 by converting type and dividing by 32768.0
  5. Raises RuntimeError if ffmpeg fails (non-zero return code), including stderr output in the error message

Example

import whisper

# Load audio from an MP3 file at default 16kHz
audio = whisper.load_audio("speech.mp3")
print(audio.shape)   # (num_samples,) e.g., (480000,) for 30s at 16kHz
print(audio.dtype)   # float32
print(audio.min(), audio.max())  # Values in [-1.0, 1.0]

# Load at a different sample rate
audio_8k = whisper.load_audio("speech.wav", sr=8000)
print(audio_8k.shape)  # Half the samples compared to 16kHz

# Works with any ffmpeg-supported format
audio = whisper.load_audio("recording.m4a")
audio = whisper.load_audio("video.mp4")      # Extracts audio track
audio = whisper.load_audio("podcast.ogg")

Notes

  • ffmpeg must be installed and accessible on the system PATH for this function to work
  • The function streams audio through a pipe, so it does not create temporary files
  • For very large files, the entire decoded waveform is loaded into memory at once
  • If the input file does not exist or is not a valid audio file, ffmpeg will return a non-zero exit code, causing a RuntimeError
  • The default sample rate of 16000 Hz matches Whisper's training configuration and should not be changed when preparing audio for Whisper inference

Metadata

Principle:Openai_Whisper_Audio_Loading Environment:Openai_Whisper_FFmpeg 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment