Principle:Openai Whisper Punctuation Merging
Overview
Punctuation Merging is a post-processing technique that merges standalone punctuation tokens with their adjacent words to produce cleaner word-level timestamps. After word boundary detection, punctuation marks may appear as separate "words" with their own timestamps. For a better user experience, leading punctuation should be merged with the following word, and trailing punctuation should be merged with the preceding word.
Domain
- Natural Language Processing
- Text Processing
The Problem
After subword-to-word grouping, punctuation marks often end up as isolated single-character "words":
Before merging:
[0.00-0.50] " (opening quote)
[0.50-1.20] Hello
[1.20-1.30] , (comma)
[1.30-1.80] world
[1.80-1.90] . (period)
[1.90-2.00] " (closing quote)
These standalone punctuation entries are undesirable because:
- They clutter word-level output with entries that carry no spoken content.
- Their timestamps are often unreliable since punctuation has no acoustic realization.
- Subtitle and display systems expect punctuation to be attached to words.
Merging Strategy
The merging follows two rules based on punctuation type:
Prepended (Leading) Punctuation
Punctuation that logically precedes a word should be merged forward with the following word. Common examples:
| Character | Name | Example |
|---|---|---|
| " | Double quote (opening) | "Hello |
| ' | Single quote (opening) | 'world |
| « | Left guillemet | «bonjour |
| ¿ | Inverted question mark | ¿Como |
| ( | Left parenthesis | (note) |
| [ | Left bracket | [ref] |
| { | Left brace | {text} |
| - | Hyphen/dash | -interrupted |
Appended (Trailing) Punctuation
Punctuation that logically follows a word should be merged backward with the preceding word. Common examples:
| Character | Name | Example |
|---|---|---|
| " | Double quote (closing) | world" |
| ' | Single quote (closing) | world' |
| . | Period | world. |
| , | Comma | world, |
| ! | Exclamation mark | world! |
| ? | Question mark | world? |
| : | Colon | word: |
| ) | Right parenthesis | (note) |
| ] | Right bracket | [ref] |
| } | Right brace | {text} |
Two-Pass Algorithm
The merging is performed in two passes:
- Reverse pass (leading punctuation): Iterate through the word list in reverse. If a word consists entirely of prepended punctuation characters, merge it into the following word by concatenating the text and token lists, and mark the punctuation entry as empty.
- Forward pass (trailing punctuation): Iterate through the word list forward. If a word consists entirely of appended punctuation characters, merge it into the preceding word by concatenating the text and token lists, and mark the punctuation entry as empty.
After both passes, empty entries are filtered out.
Result
After merging:
[0.00-1.20] "Hello,
[1.30-2.00] world."
The merged words inherit the combined token lists and the timing boundaries of their constituent parts.
Duration Anomaly Handling
In addition to punctuation merging, the word-level timestamp pipeline applies duration heuristics to handle anomalies at sentence boundaries. Words with durations exceeding a threshold (typically 2x the median word duration) are flagged and their end times are adjusted to prevent unreasonably long word durations that can occur at segment boundaries.
Implementation
Implementation:Openai_Whisper_Merge_Punctuations Heuristic:Openai_Whisper_Median_Word_Duration_Clamping