Principle:Openai Whisper Full Transcription
Overview
Full Transcription is the end-to-end orchestration principle that coordinates all stages of the Whisper speech recognition pipeline. It combines audio preprocessing, language detection, sliding window decoding with temperature fallback, segment assembly, and optional word-level timestamp extraction into a single high-level operation that takes an audio file and produces a complete, time-aligned transcript.
This is the top-level principle in the Whisper pipeline hierarchy, composing several lower-level principles into a unified workflow.
Pipeline Stages
The full transcription pipeline proceeds through the following stages:
Stage 1: Audio Preprocessing
Raw audio (from a file path, NumPy array, or PyTorch tensor) is converted into a mel spectrogram:
- Load audio — Decode the audio file to a 16kHz mono waveform using FFmpeg.
- Compute mel spectrogram — Apply a Short-Time Fourier Transform (STFT) with a Hann window, then map to 80 mel frequency bands and apply log scaling.
The resulting mel spectrogram has shape (80, N) where N depends on the audio duration (100 frames per second).
Stage 2: Language Detection
If no language is specified, the first 30-second segment is used to detect the spoken language. The encoder processes the mel spectrogram, and the decoder examines the probability distribution over language tokens to identify the most likely language. This detection occurs once and applies to the entire audio file.
Stage 3: Sliding Window Decoding
The mel spectrogram is processed in sequential 30-second windows:
- Extract a 30-second mel segment starting at the current seek position.
- Decode the segment using the configured strategy (greedy, sampling, or beam search).
- Apply temperature fallback: if decoding quality metrics indicate failure, retry at progressively higher temperatures.
- Parse timestamp tokens to determine segment boundaries.
- Advance the seek position based on the decoded timestamps.
- Repeat until the entire audio has been processed.
Stage 4: Cross-Segment Conditioning
When condition_on_previous_text is enabled (the default), the decoded text from the previous segment is provided as a prompt to the current segment's decoder. This maintains consistency across segment boundaries by:
- Preserving spelling of proper nouns and technical terms
- Maintaining stylistic consistency
- Reducing hallucination at segment transitions
An initial_prompt can also be provided to prime the first segment with domain-specific vocabulary or formatting preferences. When carry_initial_prompt is enabled, this initial prompt is prepended to the context for every segment, not just the first.
Stage 5: Quality Filtering
Each decoded segment is evaluated against quality thresholds:
- Compression ratio threshold (default 2.4) — Segments with text that compresses too well are likely repetitive or garbled.
- Log probability threshold (default -1.0) — Segments where the model was very uncertain are likely incorrect.
- No-speech threshold (default 0.6) — Segments with high no-speech probability may contain only silence.
Segments that fail these checks trigger temperature fallback or are marked as no-speech.
Stage 6: Word-Level Timestamps (Optional)
When word_timestamps is enabled, Dynamic Time Warping (DTW) is applied to the cross-attention weights to align individual words with specific time positions within each segment. This provides finer-grained timing than the segment-level timestamps from the decoder's timestamp tokens.
Word-level timestamps include punctuation handling:
- Prepend punctuations (e.g., opening quotes, brackets) are attached to the following word's timing.
- Append punctuations (e.g., periods, commas, closing quotes) are attached to the preceding word's timing.
Stage 7: Result Assembly
All segments from all windows are assembled into a final result dictionary containing:
- text — The complete transcript as a single string
- segments — A list of segment dictionaries with timing, text, tokens, and metadata
- language — The detected or specified language code
Temperature Fallback Strategy
The temperature fallback is a key robustness mechanism. The temperature parameter accepts a tuple of temperatures to try in order (default: (0.0, 0.2, 0.4, 0.6, 0.8, 1.0)). For each 30-second window:
- Decode at the first (lowest) temperature.
- If the compression ratio exceeds the threshold or the average log probability is below the threshold, discard the result and try the next temperature.
- Accept the first result that passes both quality checks.
- If all temperatures fail, use the result from the last temperature.
This approach starts with deterministic, high-confidence decoding and falls back to increasingly random sampling only when needed.
Clip Timestamps
The clip_timestamps parameter allows processing specific time ranges within the audio rather than the entire file. This is useful for:
- Re-transcribing specific portions of a long recording
- Processing pre-segmented audio with known time boundaries
- Skipping known non-speech regions
Hallucination Silence Threshold
The hallucination_silence_threshold parameter helps detect and suppress hallucinated text during silent portions. When set, if a segment's duration of silence (based on timestamp gaps) exceeds this threshold, the hallucinated text is suppressed.
References
- Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2209.11302