Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Merge Punctuations

From Leeroopedia

Overview

merge_punctuations() and add_word_timestamps() are functions in Whisper's timing module that handle punctuation merging and the overall orchestration of word-level timestamp generation. merge_punctuations() performs a two-pass in-place merge of punctuation into adjacent words, while add_word_timestamps() is the high-level function that calls the entire alignment pipeline and distributes word timestamps across segments.

Source

Signatures

merge_punctuations

def merge_punctuations(alignment: List[WordTiming], prepended: str, appended: str) -> None:

add_word_timestamps

def add_word_timestamps(
    *,
    segments: List[dict],
    model: "Whisper",
    tokenizer: Tokenizer,
    mel: torch.Tensor,
    num_frames: int,
    prepend_punctuations: str = "\"'"¿([{-",
    append_punctuations: str = "\"'.。,,!!??::")]}",
    last_speech_timestamp: float,
    **kwargs,
) -> None:

Parameters

merge_punctuations

Parameter Type Description
alignment List[WordTiming] Word timing entries to modify in-place
prepended str Punctuation characters to merge with the next word (e.g., "'«¿([{-)
appended str Punctuation characters to merge with the previous word (e.g., "'.ˆ,!?:)]})

add_word_timestamps

Parameter Type Description
segments List[dict] Segment dictionaries to add word timestamps to (modified in-place)
model Whisper Whisper model instance for running alignment
tokenizer Tokenizer Tokenizer for token processing
mel torch.Tensor Mel spectrogram of the full audio
num_frames int Total number of audio frames
prepend_punctuations str Leading punctuation to merge forward (default: "'«¿([{-)
append_punctuations str Trailing punctuation to merge backward (default: "'.ˆ,!?:)]})
last_speech_timestamp float Last detected speech timestamp for duration heuristics

Return Value

Both functions return None. They modify their inputs in-place.

Behavior

merge_punctuations

Source: whisper/timing.py:L245-277

Performs a two-pass merge:

Pass 1 (reverse iteration -- leading punctuation):

  1. Iterates through the alignment list in reverse order.
  2. If a word consists entirely of prepended punctuation characters, merges it into the following word.
  3. The merge concatenates the punctuation word's text onto the front of the next word and prepends its tokens.
  4. The merged entry is emptied (word set to empty string).

Pass 2 (forward iteration -- trailing punctuation):

  1. Iterates through the alignment list forward.
  2. If a word consists entirely of appended punctuation characters, merges it into the preceding word.
  3. The merge appends the punctuation text to the preceding word and extends its token list.
  4. The merged entry is emptied.

add_word_timestamps

Source: whisper/timing.py:L279-388

Orchestrates the full word-level timestamp pipeline:

  1. Per-segment alignment: For each segment, calls find_alignment() with the segment's tokens and mel spectrogram.
  2. Duration heuristics: Computes median word duration and flags words exceeding 2x the median as anomalous, adjusting their end times.
  3. Punctuation merging: Calls merge_punctuations() with the configured punctuation sets.
  4. Filtering: Removes empty entries created during punctuation merging.
  5. Segment distribution: Distributes the aligned words across segments, adjusting word boundaries at segment transitions.
  6. In-place update: Sets the "words" key on each segment dictionary with the word timing data.

Example Usage

from whisper.timing import merge_punctuations

# alignment contains WordTiming objects like: ['"', 'Hello', ',', 'world', '.']
merge_punctuations(alignment, prepended='"¿([{-', appended="'.。,!?:")
# Result: ['"Hello', ',', 'world.', '']  (empty entries filtered later)

Full Pipeline Example

from whisper.timing import add_word_timestamps

# segments from transcription result
add_word_timestamps(
    segments=result["segments"],
    model=model,
    tokenizer=tokenizer,
    mel=mel,
    num_frames=num_frames,
    last_speech_timestamp=last_speech_ts,
)

# Each segment now has a "words" key
for segment in result["segments"]:
    for word_info in segment["words"]:
        print(f"[{word_info['start']:.2f}-{word_info['end']:.2f}] {word_info['word']}")

Links

Principle:Openai_Whisper_Punctuation_Merging Heuristic:Openai_Whisper_Median_Word_Duration_Clamping

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment