Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Openai Whisper Word Level Timestamps

From Leeroopedia
Knowledge Sources
Domains Speech_Recognition, Audio_Alignment, Subtitle_Generation
Last Updated 2025-06-25 00:00 GMT

Overview

End-to-end process for transcribing audio with precise word-level timing by combining Whisper transcription with cross-attention-based Dynamic Time Warping alignment.

Description

This workflow extends the standard Whisper transcription pipeline to produce word-level timestamps for each recognized word. After performing segment-level transcription, it extracts cross-attention weights from specific alignment heads in the decoder, applies median filtering for noise reduction, then uses Dynamic Time Warping (DTW) to align text tokens to audio frames. The result is a transcript where every word has a precise start and end time, enabling applications like karaoke-style subtitles, audio search, and content editing.

Key capabilities:

  • Per-word start and end timestamps with probability scores
  • Hallucination detection and silent period skipping
  • Punctuation merging for natural word groupings
  • Word-highlighted subtitle generation (VTT/SRT with tags)
  • GPU-accelerated DTW via Triton kernels (with CPU fallback)

Usage

Execute this workflow when you need more than segment-level timestamps, such as when building subtitle files with per-word highlighting, aligning transcripts to audio for editing, or performing audio search with word-level precision. Requires the word_timestamps=True parameter. Note that word-level timestamps on translations may not be reliable.

Execution Steps

Step 1: Model Loading with Alignment Heads

Load the Whisper model and ensure alignment heads metadata is available. Alignment heads are specific cross-attention heads (identified by layer and head index) that have been found to correlate highly with word-level audio-text timing. This metadata is stored as a base85-encoded boolean array for each official model.

Key considerations:

  • Alignment heads are pre-computed and bundled with the model
  • Only official model variants have alignment head metadata
  • The model must be loaded with set_alignment_heads() to enable word timing

Step 2: Transcription with Cross-Attention Capture

Run the standard sliding-window transcription pipeline to obtain segment-level results. When word_timestamps is enabled, the transcription loop prepares to extract word-level timing from each 30-second window after decoding.

What happens:

  • Standard transcription executes (mel spectrogram, language detection, decoding)
  • After each window is decoded, the segments are passed to the word timestamp extraction step
  • The mel segment for the current window is preserved for alignment

Step 3: Cross-Attention Weight Extraction

For each decoded segment, re-run a forward pass through the model with hooks installed on the cross-attention layers. These hooks capture the query-key attention matrices that reveal which audio frames correspond to which text tokens. Scaled-Dot-Product Attention (SDPA) is temporarily disabled to ensure explicit attention weights are computed.

What happens:

  • Forward hooks are registered on all cross-attention layers in the decoder
  • A forward pass with the transcribed tokens (without timestamps) captures attention matrices
  • Token probabilities are computed from the logits for word confidence scores
  • Only the alignment heads (not all attention heads) are selected for further processing

Step 4: Dynamic Time Warping Alignment

Apply median filtering to the extracted attention weights to reduce noise, normalize them, then compute a DTW alignment between text tokens and audio frames. This produces a monotonic mapping from tokens to time positions.

What happens:

  • Attention weights from alignment heads are stacked and trimmed to the actual audio frames
  • Softmax normalization, z-score standardization, and median filtering are applied
  • Weights are averaged across alignment heads to produce a single token-frame matrix
  • DTW finds the optimal monotonic alignment path through the cost matrix
  • GPU acceleration via Triton kernels when available, with CPU/numba fallback

Step 5: Word Boundary Detection

Map the token-level DTW alignment to word-level boundaries using the tokenizer's word splitting. Compute start and end times for each word, along with a probability score derived from the model's token-level confidence.

What happens:

  • Tokens are grouped into words using the tokenizer's split_to_word_tokens method
  • Jump points in the DTW alignment path mark token transitions
  • Word boundaries are derived from cumulative token counts
  • Start/end times are computed from the alignment path's time indices
  • Word probabilities are averaged from constituent token probabilities

Step 6: Punctuation Merging and Anomaly Handling

Merge punctuation marks with their adjacent words for natural grouping. Detect and handle hallucinated segments by scoring words for anomalies (too short, too long, or low probability). Skip silent periods before possible hallucinations to improve output quality.

Key considerations:

  • Prepended punctuations (quotes, brackets) merge with the following word
  • Appended punctuations (periods, commas) merge with the preceding word
  • Word anomaly scoring flags words with probability below 0.15 or unusual duration
  • Segments with high anomaly scores may trigger seek adjustments to skip hallucinations
  • Long words at sentence boundaries are truncated to twice the median word duration

Step 7: Word-Level Output Generation

Produce the final output with word-level timing embedded in each segment. When generating subtitle formats (VTT/SRT), support word highlighting where each word is underlined as it is spoken, and configurable line width and line count for subtitle formatting.

Available options:

  • highlight_words: Underline each word at its spoken time in VTT/SRT
  • max_line_width: Maximum characters per subtitle line
  • max_line_count: Maximum lines per subtitle segment
  • max_words_per_line: Maximum words before line break

Execution Diagram

GitHub URL

Workflow Repository