Implementation:Openai Whisper DTW
Overview
dtw() is a dispatcher function that performs Dynamic Time Warping on a cost matrix, selecting between a Numba JIT-compiled CPU backend and a Triton-based CUDA backend depending on the tensor device. It is used in Whisper's word-level timestamp pipeline to find the optimal monotonic alignment between text tokens and audio frames.
Source
- File:
whisper/timing.py, lines 82-151 - Triton kernel:
whisper/triton_ops.py, lines 13-41 (dtw_kernel) - Repository: https://github.com/openai/whisper
- Import:
from whisper.timing import dtw
Signatures
Main Dispatcher
def dtw(x: torch.Tensor) -> np.ndarray:
CPU Backend (Numba JIT)
@numba.jit(nopython=True, parallel=True)
def dtw_cpu(x: np.ndarray) -> np.ndarray:
CUDA Backend (Triton)
def dtw_cuda(x, BLOCK_SIZE=1024) -> np.ndarray:
Parameters
| Parameter | Type | Description |
|---|---|---|
| x | torch.Tensor | Cost matrix of shape (N, M) where N = number of text tokens and M = number of audio frames. Typically the negated attention weights. |
Return Value
Returns an np.ndarray of shape (2, path_length) containing:
- Row 0: text token indices along the alignment path
- Row 1: audio frame indices along the alignment path
Behavior
Dispatcher (dtw)
- Checks if the input tensor is on a CUDA device.
- If CUDA: calls
dtw_cuda(x). - If CPU: calls
dtw_cpu(x.double().numpy()).
CPU Backend (dtw_cpu)
- Allocates accumulated cost matrix and trace matrix of shape (N, M).
- Fills the accumulated cost matrix using the DP recurrence:
D[i, j] = C[i, j] + min(D[i-1, j-1], D[i-1, j], D[i, j-1]). - Uses Numba's
nopython=Trueandparallel=Truefor JIT-compiled, parallelized execution. - Backtraces from (N-1, M-1) to (0, 0) to recover the optimal path.
CUDA Backend (dtw_cuda)
- Uses a skewed matrix layout to enable parallel computation of anti-diagonals on the GPU.
- Launches a Triton kernel (
dtw_kernel) for each anti-diagonal, where all cells on an anti-diagonal are independent and can be computed in parallel. - After the forward pass, performs backtrace on CPU to recover the optimal path.
- The
BLOCK_SIZEparameter controls Triton kernel parallelism (default 1024).
Backtrace
Both backends produce a trace matrix encoding which predecessor cell was chosen at each position. The backtrace follows this trace from (N-1, M-1) back to (0, 0), collecting the path indices.
Example Usage
from whisper.timing import dtw
import torch
# Negative attention matrix as cost
cost_matrix = -attention_weights # shape (num_tokens, num_frames)
text_indices, time_indices = dtw(cost_matrix)
# text_indices[k] maps to time_indices[k]
Performance Characteristics
| Backend | Technology | Best For |
|---|---|---|
| dtw_cpu | Numba JIT (nopython, parallel) | CPU-only environments, moderate-size matrices |
| dtw_cuda | Triton kernel with anti-diagonal parallelism | GPU environments, large matrices |
The anti-diagonal parallelization in the CUDA backend exploits the fact that all cells (i, j) where i + j = k (a constant) are independent given the previous anti-diagonal, allowing them to be computed simultaneously.
Links
Principle:Openai_Whisper_Dynamic_Time_Warping
Environment:Openai_Whisper_Numba Environment:Openai_Whisper_Triton