Principle:Openai Whisper Word Level Subtitle Output
Overview
Word-Level Subtitle Output is the process of generating subtitle files with per-word timing cues and optional word-level highlighting. Standard subtitle formats such as WebVTT and SRT can display text with word-level timing, enabling karaoke-style highlighting where each word is visually emphasized (e.g., underlined) as it is spoken. This requires iterating over word-level timestamps, managing line breaks based on character width and line count constraints, and optionally inserting HTML markup for the currently-spoken word.
Domain
- Speech Recognition
- Subtitle Generation
Subtitle Formats
WebVTT (Web Video Text Tracks)
WebVTT is a W3C standard format for displaying timed text tracks in web browsers. It supports:
- Cue timing with millisecond precision
- HTML-like formatting tags including
<u>(underline),<b>(bold),<i>(italic) - Positioning and alignment directives
- Word-level highlighting via multiple cues with formatting tags
Example with word highlighting:
WEBVTT
00:00.000 --> 00:02.400
<u>Hello</u> world how are you
00:00.500 --> 00:02.400
Hello <u>world</u> how are you
00:00.800 --> 00:02.400
Hello world <u>how</u> are you
SRT (SubRip Text)
SRT is a widely supported subtitle format with simpler syntax:
1
00:00:00,000 --> 00:00:02,400
Hello world how are you
SRT also supports basic HTML tags for formatting in many players.
Line Breaking Strategy
When generating subtitles from word-level timestamps, several constraints govern how text is split across subtitle blocks:
Max Line Width
Each subtitle line should not exceed a maximum character width (e.g., 42 characters) to ensure readability. When adding a word would exceed this width, a new line is started.
Max Line Count
Each subtitle block should not contain more than a configurable number of lines (typically 2-3). When the line count would be exceeded, a new subtitle block is started.
Max Words Per Line
An optional constraint limiting the number of words per line, useful for ensuring consistent pacing.
Pause-Based Breaking
Long pauses between words (e.g., more than 3 seconds) trigger a new subtitle block, even if width and count limits have not been reached. This aligns subtitle display with natural speech pauses.
Word-Level Highlighting
When word-level highlighting is enabled:
- For each word in a subtitle block, a separate cue is generated.
- In each cue, the currently-spoken word is wrapped in
<u>(underline) tags. - All other words in the same block appear without formatting.
- The cue timing spans from the current word's start time to the block's end time.
This creates a karaoke effect where the underline advances from word to word as the audio plays.
Fallback Behavior
If word-level timestamp data is not available in the transcription result (e.g., word_timestamps was not enabled during transcription), the subtitle writer falls back to segment-level timing, where each segment becomes a single subtitle block with the segment's start and end times.
Implementation
Implementation:Openai_Whisper_SubtitlesWriter_Iterate_Result