Heuristic:Openai Whisper No Speech Detection
| Knowledge Sources | |
|---|---|
| Domains | Decoding, Voice_Activity_Detection |
| Last Updated | 2025-06-25 00:00 GMT |
Overview
No-speech probability threshold of 0.6 used to detect silent audio segments and skip them during transcription, preventing hallucinated text output.
Description
Whisper's decoder produces a `no_speech_prob` value at the start-of-transcript (SOT) token position, representing the model's estimate that the audio contains no speech. When this probability exceeds 0.6, and the average log probability is also below the log probability threshold, the segment is classified as silence and skipped entirely. This prevents the model from hallucinating text for silent or noise-only segments.
Usage
Use this heuristic to skip silent segments during transcription. The threshold is configurable via the `no_speech_threshold` parameter in `transcribe()`. Set to `None` to disable silence detection. Lower values are more aggressive at skipping (may skip quiet speech); higher values only skip very confident silence.
The Insight (Rule of Thumb)
- Action: Set `no_speech_threshold=0.6` (default) in `transcribe()`.
- Value: 0.6 — no-speech probability above this, combined with low log probability, triggers segment skipping.
- Trade-off: Too low may skip quiet speech; too high may allow hallucinations during silent pauses.
- Key detail: The no-speech check requires both conditions: high no-speech probability AND low average log probability. If the log probability is high (model is confident), the segment is not skipped even with high no-speech probability.
Reasoning
Whisper is trained with a special `<|nospeech|>` token. The model learns to assign high probability to this token when the audio segment contains no speech. However, the no-speech signal alone is not sufficient — the model sometimes assigns moderate no-speech probability to quiet speech. By requiring both high no-speech probability and low log probability, the heuristic achieves better precision in silence detection.
Code evidence from `whisper/transcribe.py:46,298-310`:
no_speech_threshold: Optional[float] = 0.6,
if no_speech_threshold is not None:
# no voice activity check
should_skip = result.no_speech_prob > no_speech_threshold
if (
logprob_threshold is not None
and result.avg_logprob > logprob_threshold
):
# don't skip if the logprob is high enough, despite the no_speech_prob
should_skip = False
if should_skip:
seek += segment_size # fast-forward to the next segment boundary
continue
No-speech probability collection from `whisper/decoding.py:689-693`:
if (
i == 0 and self.tokenizer.no_speech is not None
): # save no_speech_probs
probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)
no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()