Workflow:Openai Whisper Word Level Timestamps

Knowledge Sources	OpenAI Whisper Robust Speech Recognition via Large-Scale Weak Supervision
Domains	Speech_Recognition, Audio_Alignment, Subtitle_Generation
Last Updated	2025-06-25 00:00 GMT

Overview

End-to-end process for transcribing audio with precise word-level timing by combining Whisper transcription with cross-attention-based Dynamic Time Warping alignment.

Description

This workflow extends the standard Whisper transcription pipeline to produce word-level timestamps for each recognized word. After performing segment-level transcription, it extracts cross-attention weights from specific alignment heads in the decoder, applies median filtering for noise reduction, then uses Dynamic Time Warping (DTW) to align text tokens to audio frames. The result is a transcript where every word has a precise start and end time, enabling applications like karaoke-style subtitles, audio search, and content editing.

Key capabilities:

Per-word start and end timestamps with probability scores
Hallucination detection and silent period skipping
Punctuation merging for natural word groupings
Word-highlighted subtitle generation (VTT/SRT with tags)

GPU-accelerated DTW via Triton kernels (with CPU fallback)

Usage

Execute this workflow when you need more than segment-level timestamps, such as when building subtitle files with per-word highlighting, aligning transcripts to audio for editing, or performing audio search with word-level precision. Requires the word_timestamps=True parameter. Note that word-level timestamps on translations may not be reliable.

Execution Steps

Step 1: Model Loading with Alignment Heads

Load the Whisper model and ensure alignment heads metadata is available. Alignment heads are specific cross-attention heads (identified by layer and head index) that have been found to correlate highly with word-level audio-text timing. This metadata is stored as a base85-encoded boolean array for each official model.

Key considerations:

Alignment heads are pre-computed and bundled with the model
Only official model variants have alignment head metadata
The model must be loaded with set_alignment_heads() to enable word timing

Step 2: Transcription with Cross-Attention Capture

Run the standard sliding-window transcription pipeline to obtain segment-level results. When word_timestamps is enabled, the transcription loop prepares to extract word-level timing from each 30-second window after decoding.

What happens:

Standard transcription executes (mel spectrogram, language detection, decoding)
After each window is decoded, the segments are passed to the word timestamp extraction step
The mel segment for the current window is preserved for alignment

Step 3: Cross-Attention Weight Extraction

For each decoded segment, re-run a forward pass through the model with hooks installed on the cross-attention layers. These hooks capture the query-key attention matrices that reveal which audio frames correspond to which text tokens. Scaled-Dot-Product Attention (SDPA) is temporarily disabled to ensure explicit attention weights are computed.

What happens:

Forward hooks are registered on all cross-attention layers in the decoder
A forward pass with the transcribed tokens (without timestamps) captures attention matrices
Token probabilities are computed from the logits for word confidence scores
Only the alignment heads (not all attention heads) are selected for further processing

Step 4: Dynamic Time Warping Alignment

Apply median filtering to the extracted attention weights to reduce noise, normalize them, then compute a DTW alignment between text tokens and audio frames. This produces a monotonic mapping from tokens to time positions.

What happens:

Attention weights from alignment heads are stacked and trimmed to the actual audio frames
Softmax normalization, z-score standardization, and median filtering are applied
Weights are averaged across alignment heads to produce a single token-frame matrix
DTW finds the optimal monotonic alignment path through the cost matrix
GPU acceleration via Triton kernels when available, with CPU/numba fallback

Step 5: Word Boundary Detection

Map the token-level DTW alignment to word-level boundaries using the tokenizer's word splitting. Compute start and end times for each word, along with a probability score derived from the model's token-level confidence.

What happens:

Tokens are grouped into words using the tokenizer's split_to_word_tokens method
Jump points in the DTW alignment path mark token transitions
Word boundaries are derived from cumulative token counts
Start/end times are computed from the alignment path's time indices
Word probabilities are averaged from constituent token probabilities

Step 6: Punctuation Merging and Anomaly Handling

Merge punctuation marks with their adjacent words for natural grouping. Detect and handle hallucinated segments by scoring words for anomalies (too short, too long, or low probability). Skip silent periods before possible hallucinations to improve output quality.

Key considerations:

Prepended punctuations (quotes, brackets) merge with the following word
Appended punctuations (periods, commas) merge with the preceding word
Word anomaly scoring flags words with probability below 0.15 or unusual duration
Segments with high anomaly scores may trigger seek adjustments to skip hallucinations
Long words at sentence boundaries are truncated to twice the median word duration

Step 7: Word-Level Output Generation

Produce the final output with word-level timing embedded in each segment. When generating subtitle formats (VTT/SRT), support word highlighting where each word is underlined as it is spoken, and configurable line width and line count for subtitle formatting.

Available options:

highlight_words: Underline each word at its spoken time in VTT/SRT
max_line_width: Maximum characters per subtitle line
max_line_count: Maximum lines per subtitle segment
max_words_per_line: Maximum words before line break

Execution Diagram

GitHub URL

Workflow Repository