Principle:Openai Whisper Dynamic Time Warping
Overview
Dynamic Time Warping (DTW) is a dynamic programming algorithm for finding the optimal alignment between two sequences that may vary in time or speed. In the context of speech recognition, DTW is used to align text tokens with audio frames by finding the minimum-cost monotonic path through a cost matrix derived from (negative) cross-attention weights.
Domain
- Speech Recognition
- Dynamic Programming
- Sequence Alignment
Theoretical Foundation
Given a cost matrix C of shape (N, M) where N is the number of text tokens and M is the number of audio frames, DTW finds a path P = [(i_1, j_1), ..., (i_K, j_K)] that minimizes the total cost:
minimize sum( C[i_k, j_k] ) for k = 1..K
subject to three constraints:
Boundary Constraint
The path must start at the top-left corner and end at the bottom-right corner:
- P starts at (0, 0)
- P ends at (N-1, M-1)
Monotonicity Constraint
The path must move forward in both dimensions (never go backwards in time):
- i_{k+1} >= i_k
- j_{k+1} >= j_k
Step Size Constraint
Each step in the path must be one of three moves:
- (1, 1) -- diagonal (advance both token and frame)
- (1, 0) -- vertical (advance token, stay on frame)
- (0, 1) -- horizontal (stay on token, advance frame)
Dynamic Programming Recurrence
The algorithm fills an accumulated cost matrix D using the recurrence:
D[i, j] = C[i, j] + min(D[i-1, j-1], D[i-1, j], D[i, j-1])
The base case is D[0, 0] = C[0, 0], with boundary rows and columns initialized by cumulative sums.
The optimal cost is D[N-1, M-1], and the optimal path is recovered by backtracing from (N-1, M-1) to (0, 0), choosing at each step the predecessor cell with minimum accumulated cost.
Application in Whisper
In Whisper's word-level timestamp extraction:
- The cross-attention weight matrix A (shape (T, F)) is computed, normalized, and filtered.
- The cost matrix is constructed as C = -A (negated because DTW minimizes cost, but we want to follow high-attention paths).
- DTW produces an alignment path mapping each text token index to its corresponding audio frame index.
- This path is used to assign start and end times to each word.
Properties
- Optimal substructure: The optimal alignment to (i, j) depends only on optimal alignments to its predecessors.
- Time complexity: O(N * M) for the forward pass.
- Space complexity: O(N * M) for the cost and trace matrices.
- Monotonicity guarantee: The alignment always preserves temporal ordering -- earlier tokens map to earlier-or-equal frames.
Comparison with Other Alignment Methods
| Method | Approach | Advantage | Limitation |
|---|---|---|---|
| DTW | Dynamic programming on cost matrix | Exact optimal alignment | O(N*M) complexity |
| CTC Forced Alignment | CTC loss-based alignment | Integrates with CTC models | Requires CTC-trained model |
| HMM-based | Hidden Markov Model alignment | Handles noise well | Requires phoneme models |
| Greedy argmax | Take argmax per token | Very fast | No monotonicity guarantee |
Implementation
Implementation:Openai_Whisper_DTW