Principle:Openai Whisper Sliding Window Decoding
Overview
Sliding Window Decoding is the strategy for processing audio files longer than 30 seconds. The Whisper encoder uses fixed-length positional embeddings designed for exactly 30 seconds of audio (3000 mel spectrogram frames). To handle longer recordings, the audio must be divided into sequential 30-second segments that are each decoded independently.
This principle describes how the sliding window advances through the audio, how context is maintained across segments, and how the temperature fallback strategy handles decoding failures.
The 30-Second Constraint
The Whisper encoder applies learned positional embeddings to the mel spectrogram input. These embeddings have a fixed size corresponding to 3000 frames (30 seconds at 100 frames per second). Audio shorter than 30 seconds is zero-padded to fill the window; audio longer than 30 seconds cannot be processed in a single pass.
This architectural constraint makes sliding window processing a necessity, not an optimization choice.
Window Advancement
The sliding window uses a seek pointer that tracks the current position in the audio (measured in mel frames). At each iteration:
- A 30-second (3000-frame) mel spectrogram is extracted starting at the current seek position.
- The segment is decoded to produce text tokens with embedded timestamps.
- The seek pointer advances based on the timestamp information in the decoded output:
- If timestamp tokens are present, the seek advances to the last decoded timestamp position.
- If no timestamps are produced, the seek advances by the full 30-second window.
This timestamp-based advancement means the window does not always move by exactly 30 seconds. If the decoder produces timestamps covering only 15 seconds of speech within a window, the seek advances by only 15 seconds, and the next window starts from that point. This provides natural overlap and avoids cutting words at segment boundaries.
Cross-Segment Context
To maintain consistency and coherence across segments, the decoder can be conditioned on previous text. When condition_on_previous_text is enabled (the default), the decoded text from the previous segment is provided as a prompt to the current segment's decoder. This helps the model:
- Maintain consistent spelling of names and terminology
- Continue partial sentences across segment boundaries
- Adapt to the speaker's style and vocabulary
An initial prompt can also be provided by the user to prime the first segment with domain-specific vocabulary or formatting preferences.
Temperature Fallback Strategy
Decoding a segment can fail, producing garbled or repetitive text. The temperature fallback strategy detects failures and retries with progressively higher temperatures:
- Start decoding at the lowest temperature (typically 0.0, greedy decoding).
- Check the output for signs of failure:
- Compression ratio above a threshold (default 2.4) — indicates repetitive text, since highly repetitive strings compress well.
- Average log probability below a threshold (default -1.0) — indicates the model was uncertain about its output.
- If failure is detected, retry at the next temperature in the sequence (e.g., 0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
- Higher temperatures introduce randomness, which can break out of repetitive decoding loops.
- The result from the first temperature that passes both thresholds is accepted.
No-Speech Detection
If the no_speech_prob exceeds a threshold (default 0.6) and the average log probability is also below its threshold, the segment is treated as silence. It may still be included in the output (depending on configuration) but is marked accordingly.
Seek Position and Timing
All timestamps within a decoded segment are relative to the start of the 30-second window. To compute absolute timestamps, the seek position (in mel frames) is converted to seconds and added to the relative timestamps:
- absolute_start = seek_time + relative_start
- absolute_end = seek_time + relative_end
Where seek_time = seek_position * time_precision and time_precision = 0.02 seconds per frame.
Edge Cases
- Audio shorter than 30 seconds: Zero-padded to fill the window. Only one decoding pass is needed.
- Final segment shorter than 30 seconds: Also zero-padded. The decoder's timestamp tokens will naturally end before the 30-second mark.
- Completely silent segments: Detected by the no-speech probability threshold and handled without advancing the context prompt.
References
- Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2209.11302