Heuristic:Openai Whisper Temperature Fallback Strategy
| Knowledge Sources | |
|---|---|
| Domains | Decoding, Robustness |
| Last Updated | 2025-06-25 00:00 GMT |
Overview
Progressive temperature fallback strategy that retries decoding at increasing temperatures (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) when quality thresholds are not met, improving transcription robustness.
Description
Whisper's transcription pipeline uses a multi-temperature fallback loop to handle difficult audio segments. It first attempts greedy decoding at temperature 0.0 (deterministic). If the result fails quality checks (compression ratio too high or average log probability too low), it retries with progressively higher temperatures, introducing more randomness into the sampling. This strategy provides a balance between deterministic accuracy for easy segments and stochastic recovery for difficult ones.
Usage
Use this heuristic when transcription quality is inconsistent across segments. The default temperature tuple `(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)` works well for general use. For audio with frequent challenging segments (accented speech, background noise), the fallback mechanism activates automatically. Disable it by passing a single temperature value (e.g., `temperature=0.0`) for deterministic-only decoding.
The Insight (Rule of Thumb)
- Action: Pass a tuple of temperatures to `transcribe()` via the `temperature` parameter.
- Value: Default `(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)` — six attempts from greedy to fully random.
- Trade-off: Higher temperatures increase diversity but may reduce accuracy. The fallback loop adds latency for difficult segments (up to 6x decode time per segment in worst case).
- Key detail: At temperature 0 (greedy), `beam_size` and `patience` are active but `best_of` is disabled. At temperature > 0 (sampling), `beam_size` and `patience` are disabled but `best_of` is active.
Reasoning
Autoregressive sequence-to-sequence models can get stuck in repetitive loops (high compression ratio) or produce low-confidence gibberish (low log probability) on difficult audio. Increasing temperature adds noise to the logit distribution, breaking out of degenerate modes. The progressive approach ensures easy segments are decoded quickly at temperature 0, while difficult segments get multiple chances with increasing randomness.
Code evidence from `whisper/transcribe.py:184-223`:
def decode_with_fallback(segment: torch.Tensor) -> DecodingResult:
temperatures = (
[temperature] if isinstance(temperature, (int, float)) else temperature
)
decode_result = None
for t in temperatures:
kwargs = {**decode_options}
if t > 0:
# disable beam_size and patience when t > 0
kwargs.pop("beam_size", None)
kwargs.pop("patience", None)
else:
# disable best_of when t == 0
kwargs.pop("best_of", None)
options = DecodingOptions(**kwargs, temperature=t)
decode_result = model.decode(segment, options)
needs_fallback = False
if (
compression_ratio_threshold is not None
and decode_result.compression_ratio > compression_ratio_threshold
):
needs_fallback = True # too repetitive
if (
logprob_threshold is not None
and decode_result.avg_logprob < logprob_threshold
):
needs_fallback = True # average log probability is too low
if (
no_speech_threshold is not None
and decode_result.no_speech_prob > no_speech_threshold
and logprob_threshold is not None
and decode_result.avg_logprob < logprob_threshold
):
needs_fallback = False # silence
if not needs_fallback:
break
return decode_result
Additionally, the prompt is reset when temperature exceeds 0.5, from `whisper/transcribe.py:503-505`:
if not condition_on_previous_text or result.temperature > 0.5:
# do not feed the prompt tokens if a high temperature was used
prompt_reset_since = len(all_tokens)