Implementation:Openai Whisper Load Audio
Appearance
Overview
whisper.load_audio() decodes an audio file of any format into a normalized mono float32 numpy array at a specified sample rate. It uses ffmpeg as a subprocess to handle universal format support, resampling, and channel downmixing.
Source
- File: whisper/audio.py:L25-62
- Repository: https://github.com/openai/whisper
Signature
def load_audio(file: str, sr: int = SAMPLE_RATE) -> np.ndarray:
Where SAMPLE_RATE = 16000.
Import
from whisper.audio import load_audio
# or
import whisper # re-exported as whisper.load_audio
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| file | str | (required) | Path to the audio file. Accepts any format that ffmpeg supports (MP3, WAV, FLAC, OGG, M4A, MP4, WebM, etc.) |
| sr | int | 16000 | Target sample rate in Hz. The audio is resampled to this rate. |
Inputs and Outputs
Inputs
- An audio file path string pointing to a file in any format supported by ffmpeg
Outputs
- A numpy.ndarray of dtype float32 containing the mono waveform normalized to the range [-1.0, 1.0], with shape (num_samples,)
Behavior
- Spawns an ffmpeg subprocess with the following pipeline:
- -i {file} — reads the input file
- -f s16le — outputs raw 16-bit signed little-endian PCM
- -acodec pcm_s16le — uses PCM codec for output
- -ac 1 — downmixes to mono (single channel)
- -ar {sr} — resamples to the target sample rate
- pipe:1 — streams output to stdout
- Reads the raw PCM bytes from the subprocess stdout
- Converts to numpy array using np.frombuffer(out, np.int16)
- Normalizes to float32 by converting type and dividing by 32768.0
- Raises RuntimeError if ffmpeg fails (non-zero return code), including stderr output in the error message
Example
import whisper
# Load audio from an MP3 file at default 16kHz
audio = whisper.load_audio("speech.mp3")
print(audio.shape) # (num_samples,) e.g., (480000,) for 30s at 16kHz
print(audio.dtype) # float32
print(audio.min(), audio.max()) # Values in [-1.0, 1.0]
# Load at a different sample rate
audio_8k = whisper.load_audio("speech.wav", sr=8000)
print(audio_8k.shape) # Half the samples compared to 16kHz
# Works with any ffmpeg-supported format
audio = whisper.load_audio("recording.m4a")
audio = whisper.load_audio("video.mp4") # Extracts audio track
audio = whisper.load_audio("podcast.ogg")
Notes
- ffmpeg must be installed and accessible on the system PATH for this function to work
- The function streams audio through a pipe, so it does not create temporary files
- For very large files, the entire decoded waveform is loaded into memory at once
- If the input file does not exist or is not a valid audio file, ffmpeg will return a non-zero exit code, causing a RuntimeError
- The default sample rate of 16000 Hz matches Whisper's training configuration and should not be changed when preparing audio for Whisper inference
Metadata
Principle:Openai_Whisper_Audio_Loading Environment:Openai_Whisper_FFmpeg 2025-06-25 00:00 GMT
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment