Heuristic:Openai Whisper No Speech Detection

Knowledge Sources	OpenAI Whisper
Domains	Decoding, Voice_Activity_Detection
Last Updated	2025-06-25 00:00 GMT

Overview

No-speech probability threshold of 0.6 used to detect silent audio segments and skip them during transcription, preventing hallucinated text output.

Description

Whisper's decoder produces a `no_speech_prob` value at the start-of-transcript (SOT) token position, representing the model's estimate that the audio contains no speech. When this probability exceeds 0.6, and the average log probability is also below the log probability threshold, the segment is classified as silence and skipped entirely. This prevents the model from hallucinating text for silent or noise-only segments.

Usage

Use this heuristic to skip silent segments during transcription. The threshold is configurable via the `no_speech_threshold` parameter in `transcribe()`. Set to `None` to disable silence detection. Lower values are more aggressive at skipping (may skip quiet speech); higher values only skip very confident silence.

The Insight (Rule of Thumb)

Action: Set `no_speech_threshold=0.6` (default) in `transcribe()`.
Value: 0.6 — no-speech probability above this, combined with low log probability, triggers segment skipping.
Trade-off: Too low may skip quiet speech; too high may allow hallucinations during silent pauses.
Key detail: The no-speech check requires both conditions: high no-speech probability AND low average log probability. If the log probability is high (model is confident), the segment is not skipped even with high no-speech probability.

Reasoning

Whisper is trained with a special `<|nospeech|>` token. The model learns to assign high probability to this token when the audio segment contains no speech. However, the no-speech signal alone is not sufficient — the model sometimes assigns moderate no-speech probability to quiet speech. By requiring both high no-speech probability and low log probability, the heuristic achieves better precision in silence detection.

Code evidence from `whisper/transcribe.py:46,298-310`:

no_speech_threshold: Optional[float] = 0.6,

if no_speech_threshold is not None:
    # no voice activity check
    should_skip = result.no_speech_prob > no_speech_threshold
    if (
        logprob_threshold is not None
        and result.avg_logprob > logprob_threshold
    ):
        # don't skip if the logprob is high enough, despite the no_speech_prob
        should_skip = False

    if should_skip:
        seek += segment_size  # fast-forward to the next segment boundary
        continue

No-speech probability collection from `whisper/decoding.py:689-693`:

if (
    i == 0 and self.tokenizer.no_speech is not None
):  # save no_speech_probs
    probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)
    no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment