Principle:Openai Whisper Dynamic Time Warping

Overview

Dynamic Time Warping (DTW) is a dynamic programming algorithm for finding the optimal alignment between two sequences that may vary in time or speed. In the context of speech recognition, DTW is used to align text tokens with audio frames by finding the minimum-cost monotonic path through a cost matrix derived from (negative) cross-attention weights.

Domain

Speech Recognition
Dynamic Programming
Sequence Alignment

Theoretical Foundation

Given a cost matrix C of shape (N, M) where N is the number of text tokens and M is the number of audio frames, DTW finds a path P = [(i_1, j_1), ..., (i_K, j_K)] that minimizes the total cost:

minimize  sum( C[i_k, j_k] )  for k = 1..K

subject to three constraints:

Boundary Constraint

The path must start at the top-left corner and end at the bottom-right corner:

P starts at (0, 0)
P ends at (N-1, M-1)

Monotonicity Constraint

The path must move forward in both dimensions (never go backwards in time):

i_{k+1} >= i_k
j_{k+1} >= j_k

Step Size Constraint

Each step in the path must be one of three moves:

(1, 1) -- diagonal (advance both token and frame)
(1, 0) -- vertical (advance token, stay on frame)
(0, 1) -- horizontal (stay on token, advance frame)

Dynamic Programming Recurrence

The algorithm fills an accumulated cost matrix D using the recurrence:

D[i, j] = C[i, j] + min(D[i-1, j-1], D[i-1, j], D[i, j-1])

The base case is D[0, 0] = C[0, 0], with boundary rows and columns initialized by cumulative sums.

The optimal cost is D[N-1, M-1], and the optimal path is recovered by backtracing from (N-1, M-1) to (0, 0), choosing at each step the predecessor cell with minimum accumulated cost.

Application in Whisper

In Whisper's word-level timestamp extraction:

The cross-attention weight matrix A (shape (T, F)) is computed, normalized, and filtered.
The cost matrix is constructed as C = -A (negated because DTW minimizes cost, but we want to follow high-attention paths).
DTW produces an alignment path mapping each text token index to its corresponding audio frame index.
This path is used to assign start and end times to each word.

Properties

Optimal substructure: The optimal alignment to (i, j) depends only on optimal alignments to its predecessors.
Time complexity: O(N * M) for the forward pass.
Space complexity: O(N * M) for the cost and trace matrices.
Monotonicity guarantee: The alignment always preserves temporal ordering -- earlier tokens map to earlier-or-equal frames.

Comparison with Other Alignment Methods

Method	Approach	Advantage	Limitation
DTW	Dynamic programming on cost matrix	Exact optimal alignment	O(N*M) complexity
CTC Forced Alignment	CTC loss-based alignment	Integrates with CTC models	Requires CTC-trained model
HMM-based	Hidden Markov Model alignment	Handles noise well	Requires phoneme models
Greedy argmax	Take argmax per token	Very fast	No monotonicity guarantee

Implementation

Implementation:Openai_Whisper_DTW

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment