Principle:Openai Whisper Word Level Subtitle Output

Overview

Word-Level Subtitle Output is the process of generating subtitle files with per-word timing cues and optional word-level highlighting. Standard subtitle formats such as WebVTT and SRT can display text with word-level timing, enabling karaoke-style highlighting where each word is visually emphasized (e.g., underlined) as it is spoken. This requires iterating over word-level timestamps, managing line breaks based on character width and line count constraints, and optionally inserting HTML markup for the currently-spoken word.

Domain

Speech Recognition
Subtitle Generation

Subtitle Formats

WebVTT (Web Video Text Tracks)

WebVTT is a W3C standard format for displaying timed text tracks in web browsers. It supports:

Cue timing with millisecond precision
HTML-like formatting tags including <u> (underline), <b> (bold), <i> (italic)
Positioning and alignment directives
Word-level highlighting via multiple cues with formatting tags

Example with word highlighting:

WEBVTT

00:00.000 --> 00:02.400
<u>Hello</u> world how are you

00:00.500 --> 00:02.400
Hello <u>world</u> how are you

00:00.800 --> 00:02.400
Hello world <u>how</u> are you

SRT (SubRip Text)

SRT is a widely supported subtitle format with simpler syntax:

1
00:00:00,000 --> 00:00:02,400
Hello world how are you

SRT also supports basic HTML tags for formatting in many players.

Line Breaking Strategy

When generating subtitles from word-level timestamps, several constraints govern how text is split across subtitle blocks:

Max Line Width

Each subtitle line should not exceed a maximum character width (e.g., 42 characters) to ensure readability. When adding a word would exceed this width, a new line is started.

Max Line Count

Each subtitle block should not contain more than a configurable number of lines (typically 2-3). When the line count would be exceeded, a new subtitle block is started.

Max Words Per Line

An optional constraint limiting the number of words per line, useful for ensuring consistent pacing.

Pause-Based Breaking

Long pauses between words (e.g., more than 3 seconds) trigger a new subtitle block, even if width and count limits have not been reached. This aligns subtitle display with natural speech pauses.

Word-Level Highlighting

When word-level highlighting is enabled:

For each word in a subtitle block, a separate cue is generated.
In each cue, the currently-spoken word is wrapped in <u> (underline) tags.
All other words in the same block appear without formatting.
The cue timing spans from the current word's start time to the block's end time.

This creates a karaoke effect where the underline advances from word to word as the audio plays.

Fallback Behavior

If word-level timestamp data is not available in the transcription result (e.g., word_timestamps was not enabled during transcription), the subtitle writer falls back to segment-level timing, where each segment becomes a single subtitle block with the segment's start and end times.

Implementation

Implementation:Openai_Whisper_SubtitlesWriter_Iterate_Result

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment