Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Whisper Dynamic Time Warping

From Leeroopedia
Revision as of 17:33, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Openai_Whisper_Dynamic_Time_Warping.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Dynamic Time Warping (DTW) is a dynamic programming algorithm for finding the optimal alignment between two sequences that may vary in time or speed. In the context of speech recognition, DTW is used to align text tokens with audio frames by finding the minimum-cost monotonic path through a cost matrix derived from (negative) cross-attention weights.

Domain

  • Speech Recognition
  • Dynamic Programming
  • Sequence Alignment

Theoretical Foundation

Given a cost matrix C of shape (N, M) where N is the number of text tokens and M is the number of audio frames, DTW finds a path P = [(i_1, j_1), ..., (i_K, j_K)] that minimizes the total cost:

minimize  sum( C[i_k, j_k] )  for k = 1..K

subject to three constraints:

Boundary Constraint

The path must start at the top-left corner and end at the bottom-right corner:

  • P starts at (0, 0)
  • P ends at (N-1, M-1)

Monotonicity Constraint

The path must move forward in both dimensions (never go backwards in time):

  • i_{k+1} >= i_k
  • j_{k+1} >= j_k

Step Size Constraint

Each step in the path must be one of three moves:

  • (1, 1) -- diagonal (advance both token and frame)
  • (1, 0) -- vertical (advance token, stay on frame)
  • (0, 1) -- horizontal (stay on token, advance frame)

Dynamic Programming Recurrence

The algorithm fills an accumulated cost matrix D using the recurrence:

D[i, j] = C[i, j] + min(D[i-1, j-1], D[i-1, j], D[i, j-1])

The base case is D[0, 0] = C[0, 0], with boundary rows and columns initialized by cumulative sums.

The optimal cost is D[N-1, M-1], and the optimal path is recovered by backtracing from (N-1, M-1) to (0, 0), choosing at each step the predecessor cell with minimum accumulated cost.

Application in Whisper

In Whisper's word-level timestamp extraction:

  1. The cross-attention weight matrix A (shape (T, F)) is computed, normalized, and filtered.
  2. The cost matrix is constructed as C = -A (negated because DTW minimizes cost, but we want to follow high-attention paths).
  3. DTW produces an alignment path mapping each text token index to its corresponding audio frame index.
  4. This path is used to assign start and end times to each word.

Properties

  • Optimal substructure: The optimal alignment to (i, j) depends only on optimal alignments to its predecessors.
  • Time complexity: O(N * M) for the forward pass.
  • Space complexity: O(N * M) for the cost and trace matrices.
  • Monotonicity guarantee: The alignment always preserves temporal ordering -- earlier tokens map to earlier-or-equal frames.

Comparison with Other Alignment Methods

Method Approach Advantage Limitation
DTW Dynamic programming on cost matrix Exact optimal alignment O(N*M) complexity
CTC Forced Alignment CTC loss-based alignment Integrates with CTC models Requires CTC-trained model
HMM-based Hidden Markov Model alignment Handles noise well Requires phoneme models
Greedy argmax Take argmax per token Very fast No monotonicity guarantee

Implementation

Implementation:Openai_Whisper_DTW

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment