Workflow:Openai Whisper Audio Transcription

Knowledge Sources	OpenAI Whisper Robust Speech Recognition via Large-Scale Weak Supervision Introducing Whisper
Domains	Speech_Recognition, Audio_Processing, NLP
Last Updated	2025-06-25 00:00 GMT

Overview

End-to-end process for transcribing audio files into text using OpenAI Whisper, supporting multilingual speech recognition, English translation, and multiple output formats.

Description

This workflow covers the standard procedure for converting speech audio into text using the Whisper automatic speech recognition (ASR) model. It handles the full pipeline from model selection and loading through audio ingestion, language detection, sliding-window decoding with temperature fallback, segment boundary detection, and final output formatting. The workflow supports both the high-level Python API (model.transcribe()) and the command-line interface (whisper CLI), which share the same underlying transcription engine.

Key capabilities:

Multilingual speech recognition across 99+ languages
Speech-to-English translation (X→English)
Automatic language detection
Multiple output formats: TXT, VTT, SRT, TSV, JSON
Temperature fallback for robust decoding
Configurable beam search or greedy sampling

Usage

Execute this workflow when you have one or more audio files (in any format supported by ffmpeg) and need to produce a text transcription. This is the primary use case for Whisper and applies to both single-file Python API usage and batch CLI processing. Choose this workflow for standard transcription needs; use the Word-Level Timestamps workflow if you need per-word timing, or the Language Detection and Decoding workflow if you need low-level control over individual audio segments.

Execution Steps

Step 1: Model Selection and Loading

Select an appropriate Whisper model variant based on accuracy requirements and available hardware. The loader downloads the model checkpoint (if not cached), verifies its SHA256 integrity, deserializes the weights, instantiates the encoder-decoder transformer architecture with the correct dimensions, and moves the model to the target device (CPU or CUDA).

Key considerations:

Nine model sizes available: tiny, base, small, medium, large-v1/v2/v3, turbo
English-only variants (tiny.en through medium.en) perform better for English
The turbo model offers near-large accuracy at 8x faster speed
Models are cached in ~/.cache/whisper by default
Alignment heads metadata is loaded for models that support word timestamps

Step 2: Audio Ingestion and Preprocessing

Load the audio file using ffmpeg, converting it to a mono 16 kHz waveform in float32 format. Then compute the log-Mel spectrogram, which transforms the raw audio signal into the frequency-domain representation expected by the encoder. Thirty seconds of silence padding is appended to handle the final chunk.

What happens:

ffmpeg decodes any audio format into raw PCM (16-bit signed, mono, 16 kHz)
Short-Time Fourier Transform (STFT) with 400-sample window and 160-sample hop
Mel filterbank projection (80 or 128 bands depending on model)
Log scaling and normalization to produce the final spectrogram

Step 3: Language Detection

If no language is specified, detect the spoken language by encoding the first 30-second mel segment through the audio encoder and examining the language token probabilities from the decoder. For English-only models, this step is skipped and English is assumed.

Key considerations:

Language detection uses only the first 30 seconds of audio
The detected language sets the tokenizer configuration for the entire file
Multilingual models support 99+ languages
The task parameter determines whether to transcribe (X→X) or translate (X→English)

Step 4: Sliding Window Decoding

Process the audio through a sliding 30-second window. For each window, extract the mel segment, pad or trim to the expected size, and run autoregressive sequence-to-sequence decoding. Use temperature fallback: attempt decoding at temperature 0 first, and if quality thresholds are not met (high compression ratio or low log probability), retry at progressively higher temperatures.

What happens:

Each 30-second window is decoded independently with context from previous windows
Timestamp tokens in the output determine segment boundaries within each window
No-speech detection skips silent segments based on the no_speech probability
Previous output serves as prompt for the next window (configurable)
Greedy decoding at T=0, sampling at T>0, beam search when beam_size is set

Step 5: Segment Assembly

Assemble decoded segments from all windows into the final transcript. Parse timestamp tokens to determine precise start/end times for each segment. Filter out empty or instantaneous segments. Concatenate all segment texts and tokens into the complete transcription result.

Key considerations:

Consecutive timestamp tokens indicate segment boundaries within a window
The seek pointer advances based on the last decoded timestamp
Segments include metadata: text, tokens, temperature, log probability, compression ratio, no-speech probability
The final result contains the full text, segment list, and detected language

Step 6: Output Formatting

Format the transcription result into the desired output format(s). The CLI writes files to disk while the Python API returns a dictionary. Five output formats are supported, each with its own writer class.

Available formats:

TXT: Plain text, one segment per line
VTT: WebVTT subtitles with timestamps
SRT: SubRip subtitles with sequence numbers
TSV: Tab-separated values (start/end in milliseconds)
JSON: Full result including all metadata

Execution Diagram

GitHub URL

Workflow Repository