Implementation:Openai Whisper Load Audio

Overview

whisper.load_audio() decodes an audio file of any format into a normalized mono float32 numpy array at a specified sample rate. It uses ffmpeg as a subprocess to handle universal format support, resampling, and channel downmixing.

Source

File: whisper/audio.py:L25-62
Repository: https://github.com/openai/whisper

Signature

def load_audio(file: str, sr: int = SAMPLE_RATE) -> np.ndarray:

Where SAMPLE_RATE = 16000.

Import

from whisper.audio import load_audio
# or
import whisper  # re-exported as whisper.load_audio

Parameters

Parameter	Type	Default	Description
file	str	(required)	Path to the audio file. Accepts any format that ffmpeg supports (MP3, WAV, FLAC, OGG, M4A, MP4, WebM, etc.)
sr	int	16000	Target sample rate in Hz. The audio is resampled to this rate.

Inputs and Outputs

Inputs

An audio file path string pointing to a file in any format supported by ffmpeg

Outputs

A numpy.ndarray of dtype float32 containing the mono waveform normalized to the range [-1.0, 1.0], with shape (num_samples,)

Behavior

Spawns an ffmpeg subprocess with the following pipeline:
- -i {file} — reads the input file
- -f s16le — outputs raw 16-bit signed little-endian PCM
- -acodec pcm_s16le — uses PCM codec for output
- -ac 1 — downmixes to mono (single channel)
- -ar {sr} — resamples to the target sample rate
- pipe:1 — streams output to stdout
Reads the raw PCM bytes from the subprocess stdout
Converts to numpy array using np.frombuffer(out, np.int16)
Normalizes to float32 by converting type and dividing by 32768.0
Raises RuntimeError if ffmpeg fails (non-zero return code), including stderr output in the error message

Example

import whisper

# Load audio from an MP3 file at default 16kHz
audio = whisper.load_audio("speech.mp3")
print(audio.shape)   # (num_samples,) e.g., (480000,) for 30s at 16kHz
print(audio.dtype)   # float32
print(audio.min(), audio.max())  # Values in [-1.0, 1.0]

# Load at a different sample rate
audio_8k = whisper.load_audio("speech.wav", sr=8000)
print(audio_8k.shape)  # Half the samples compared to 16kHz

# Works with any ffmpeg-supported format
audio = whisper.load_audio("recording.m4a")
audio = whisper.load_audio("video.mp4")      # Extracts audio track
audio = whisper.load_audio("podcast.ogg")

Notes

ffmpeg must be installed and accessible on the system PATH for this function to work
The function streams audio through a pipe, so it does not create temporary files
For very large files, the entire decoded waveform is loaded into memory at once
If the input file does not exist or is not a valid audio file, ffmpeg will return a non-zero exit code, causing a RuntimeError
The default sample rate of 16000 Hz matches Whisper's training configuration and should not be changed when preparing audio for Whisper inference

Metadata

Principle:Openai_Whisper_Audio_Loading Environment:Openai_Whisper_FFmpeg 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment