Implementation:Openai Whisper Merge Punctuations

Overview

merge_punctuations() and add_word_timestamps() are functions in Whisper's timing module that handle punctuation merging and the overall orchestration of word-level timestamp generation. merge_punctuations() performs a two-pass in-place merge of punctuation into adjacent words, while add_word_timestamps() is the high-level function that calls the entire alignment pipeline and distributes word timestamps across segments.

Source

File: whisper/timing.py, lines 245-388
Repository: https://github.com/openai/whisper
Import: from whisper.timing import merge_punctuations, add_word_timestamps

Signatures

merge_punctuations

def merge_punctuations(alignment: List[WordTiming], prepended: str, appended: str) -> None:

add_word_timestamps

def add_word_timestamps(
    *,
    segments: List[dict],
    model: "Whisper",
    tokenizer: Tokenizer,
    mel: torch.Tensor,
    num_frames: int,
    prepend_punctuations: str = "\"'"¿([{-",
    append_punctuations: str = "\"'.。,，!！?？:：")]}、",
    last_speech_timestamp: float,
    **kwargs,
) -> None:

Parameters

merge_punctuations

Parameter	Type	Description
alignment	List[WordTiming]	Word timing entries to modify in-place
prepended	str	Punctuation characters to merge with the next word (e.g., `"'«¿([{-`)
appended	str	Punctuation characters to merge with the previous word (e.g., `"'.ˆ,!?:)]}`)

add_word_timestamps

Parameter	Type	Description
segments	List[dict]	Segment dictionaries to add word timestamps to (modified in-place)
model	Whisper	Whisper model instance for running alignment
tokenizer	Tokenizer	Tokenizer for token processing
mel	torch.Tensor	Mel spectrogram of the full audio
num_frames	int	Total number of audio frames
prepend_punctuations	str	Leading punctuation to merge forward (default: `"'«¿([{-`)
append_punctuations	str	Trailing punctuation to merge backward (default: `"'.ˆ,!?:)]}`)
last_speech_timestamp	float	Last detected speech timestamp for duration heuristics

Return Value

Both functions return None. They modify their inputs in-place.

Behavior

merge_punctuations

Source: whisper/timing.py:L245-277

Performs a two-pass merge:

Pass 1 (reverse iteration -- leading punctuation):

Iterates through the alignment list in reverse order.
If a word consists entirely of prepended punctuation characters, merges it into the following word.
The merge concatenates the punctuation word's text onto the front of the next word and prepends its tokens.
The merged entry is emptied (word set to empty string).

Pass 2 (forward iteration -- trailing punctuation):

Iterates through the alignment list forward.
If a word consists entirely of appended punctuation characters, merges it into the preceding word.
The merge appends the punctuation text to the preceding word and extends its token list.
The merged entry is emptied.

add_word_timestamps

Source: whisper/timing.py:L279-388

Orchestrates the full word-level timestamp pipeline:

Per-segment alignment: For each segment, calls find_alignment() with the segment's tokens and mel spectrogram.
Duration heuristics: Computes median word duration and flags words exceeding 2x the median as anomalous, adjusting their end times.
Punctuation merging: Calls merge_punctuations() with the configured punctuation sets.
Filtering: Removes empty entries created during punctuation merging.
Segment distribution: Distributes the aligned words across segments, adjusting word boundaries at segment transitions.
In-place update: Sets the "words" key on each segment dictionary with the word timing data.

Example Usage

from whisper.timing import merge_punctuations

# alignment contains WordTiming objects like: ['"', 'Hello', ',', 'world', '.']
merge_punctuations(alignment, prepended='"¿([{-', appended="'.。,!?:")
# Result: ['"Hello', ',', 'world.', '']  (empty entries filtered later)

Full Pipeline Example

from whisper.timing import add_word_timestamps

# segments from transcription result
add_word_timestamps(
    segments=result["segments"],
    model=model,
    tokenizer=tokenizer,
    mel=mel,
    num_frames=num_frames,
    last_speech_timestamp=last_speech_ts,
)

# Each segment now has a "words" key
for segment in result["segments"]:
    for word_info in segment["words"]:
        print(f"[{word_info['start']:.2f}-{word_info['end']:.2f}] {word_info['word']}")

Links

Principle:Openai_Whisper_Punctuation_Merging Heuristic:Openai_Whisper_Median_Word_Duration_Clamping

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment