Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Openai Whisper Median Word Duration Clamping

From Leeroopedia
Knowledge Sources
Domains Timestamps, Quality_Control
Last Updated 2025-06-25 00:00 GMT

Overview

Word duration anomaly detection heuristic that caps maximum word duration at 2x the median duration (clamped to 0.7s) and truncates anomalously long words at sentence boundaries.

Description

After DTW-based word-level timestamp alignment, some words may be assigned implausibly long durations due to alignment errors or silence gaps. This heuristic computes the median duration of all non-zero words, caps it at 0.7 seconds, and uses 2x this median as the maximum allowed word duration. Words exceeding this limit at sentence boundaries are truncated. Additional heuristics handle words after long pauses and ensure segment-level timestamps remain consistent.

Usage

Use this heuristic when word-level timestamps are enabled (`word_timestamps=True`). It is applied automatically by `add_word_timestamps()` in `whisper/timing.py`. The heuristic is particularly important for handling alignment artifacts at sentence boundaries and after pauses.

The Insight (Rule of Thumb)

  • Action: Compute median word duration, cap at 0.7s, and truncate words exceeding 2x median at sentence boundaries.
  • Value: `median_duration = min(0.7, median(word_durations))` and `max_duration = median_duration * 2`.
  • Trade-off: Aggressive clamping may clip legitimately long words (compound words, slow speech); too permissive allows alignment artifacts to propagate.
  • Sentence boundaries: Special treatment for sentence-ending punctuation marks `.。!!??` — if a word at a sentence boundary exceeds max_duration, its end is clamped.

Reasoning

DTW alignment produces word boundaries from attention weight patterns, which can be noisy. In practice, most words in natural speech are under 0.7 seconds. Words that appear much longer than the median are almost always alignment artifacts, especially at sentence boundaries where the attention pattern may diffuse over silence. The self-referencing approach (2x median rather than a fixed threshold) adapts to different speaking speeds.

Code evidence from `whisper/timing.py:301-317`:

word_durations = np.array([t.end - t.start for t in alignment])
word_durations = word_durations[word_durations.nonzero()]
median_duration = np.median(word_durations) if len(word_durations) > 0 else 0.0
median_duration = min(0.7, float(median_duration))
max_duration = median_duration * 2

# hack: truncate long words at sentence boundaries.
# a better segmentation algorithm based on VAD should be able to replace this.
if len(word_durations) > 0:
    sentence_end_marks = ".。!!??"
    # ensure words at sentence boundaries are not longer than twice the median word duration.
    for i in range(1, len(alignment)):
        if alignment[i].end - alignment[i].start > max_duration:
            if alignment[i].word in sentence_end_marks:
                alignment[i].end = alignment[i].start + max_duration
            elif alignment[i - 1].word in sentence_end_marks:
                alignment[i].start = alignment[i].end - max_duration

Post-pause word clamping from `whisper/timing.py:344-362`:

# hack: truncate long words at segment boundaries.
# a better segmentation algorithm based on VAD should be able to replace this.
if len(words) > 0:
    # ensure the first and second word after a pause is not longer than
    # twice the median word duration.
    if words[0]["end"] - last_speech_timestamp > median_duration * 4 and (
        words[0]["end"] - words[0]["start"] > max_duration
        or (
            len(words) > 1
            and words[1]["end"] - words[0]["start"] > max_duration * 2
        )
    ):
        ...
        words[0]["start"] = max(0, words[0]["end"] - max_duration)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment