Implementation:Openai Whisper Merge Punctuations
Overview
merge_punctuations() and add_word_timestamps() are functions in Whisper's timing module that handle punctuation merging and the overall orchestration of word-level timestamp generation. merge_punctuations() performs a two-pass in-place merge of punctuation into adjacent words, while add_word_timestamps() is the high-level function that calls the entire alignment pipeline and distributes word timestamps across segments.
Source
- File:
whisper/timing.py, lines 245-388 - Repository: https://github.com/openai/whisper
- Import:
from whisper.timing import merge_punctuations, add_word_timestamps
Signatures
merge_punctuations
def merge_punctuations(alignment: List[WordTiming], prepended: str, appended: str) -> None:
add_word_timestamps
def add_word_timestamps(
*,
segments: List[dict],
model: "Whisper",
tokenizer: Tokenizer,
mel: torch.Tensor,
num_frames: int,
prepend_punctuations: str = "\"'"¿([{-",
append_punctuations: str = "\"'.。,,!!??::")]}、",
last_speech_timestamp: float,
**kwargs,
) -> None:
Parameters
merge_punctuations
| Parameter | Type | Description |
|---|---|---|
| alignment | List[WordTiming] | Word timing entries to modify in-place |
| prepended | str | Punctuation characters to merge with the next word (e.g., "'«¿([{-)
|
| appended | str | Punctuation characters to merge with the previous word (e.g., "'.ˆ,!?:)]})
|
add_word_timestamps
| Parameter | Type | Description |
|---|---|---|
| segments | List[dict] | Segment dictionaries to add word timestamps to (modified in-place) |
| model | Whisper | Whisper model instance for running alignment |
| tokenizer | Tokenizer | Tokenizer for token processing |
| mel | torch.Tensor | Mel spectrogram of the full audio |
| num_frames | int | Total number of audio frames |
| prepend_punctuations | str | Leading punctuation to merge forward (default: "'«¿([{-)
|
| append_punctuations | str | Trailing punctuation to merge backward (default: "'.ˆ,!?:)]})
|
| last_speech_timestamp | float | Last detected speech timestamp for duration heuristics |
Return Value
Both functions return None. They modify their inputs in-place.
Behavior
merge_punctuations
Source: whisper/timing.py:L245-277
Performs a two-pass merge:
Pass 1 (reverse iteration -- leading punctuation):
- Iterates through the alignment list in reverse order.
- If a word consists entirely of prepended punctuation characters, merges it into the following word.
- The merge concatenates the punctuation word's text onto the front of the next word and prepends its tokens.
- The merged entry is emptied (word set to empty string).
Pass 2 (forward iteration -- trailing punctuation):
- Iterates through the alignment list forward.
- If a word consists entirely of appended punctuation characters, merges it into the preceding word.
- The merge appends the punctuation text to the preceding word and extends its token list.
- The merged entry is emptied.
add_word_timestamps
Source: whisper/timing.py:L279-388
Orchestrates the full word-level timestamp pipeline:
- Per-segment alignment: For each segment, calls
find_alignment()with the segment's tokens and mel spectrogram. - Duration heuristics: Computes median word duration and flags words exceeding 2x the median as anomalous, adjusting their end times.
- Punctuation merging: Calls
merge_punctuations()with the configured punctuation sets. - Filtering: Removes empty entries created during punctuation merging.
- Segment distribution: Distributes the aligned words across segments, adjusting word boundaries at segment transitions.
- In-place update: Sets the
"words"key on each segment dictionary with the word timing data.
Example Usage
from whisper.timing import merge_punctuations
# alignment contains WordTiming objects like: ['"', 'Hello', ',', 'world', '.']
merge_punctuations(alignment, prepended='"¿([{-', appended="'.。,!?:")
# Result: ['"Hello', ',', 'world.', ''] (empty entries filtered later)
Full Pipeline Example
from whisper.timing import add_word_timestamps
# segments from transcription result
add_word_timestamps(
segments=result["segments"],
model=model,
tokenizer=tokenizer,
mel=mel,
num_frames=num_frames,
last_speech_timestamp=last_speech_ts,
)
# Each segment now has a "words" key
for segment in result["segments"]:
for word_info in segment["words"]:
print(f"[{word_info['start']:.2f}-{word_info['end']:.2f}] {word_info['word']}")
Links
Principle:Openai_Whisper_Punctuation_Merging Heuristic:Openai_Whisper_Median_Word_Duration_Clamping