Implementation:Openai Whisper DTW

Overview

dtw() is a dispatcher function that performs Dynamic Time Warping on a cost matrix, selecting between a Numba JIT-compiled CPU backend and a Triton-based CUDA backend depending on the tensor device. It is used in Whisper's word-level timestamp pipeline to find the optimal monotonic alignment between text tokens and audio frames.

Source

File: whisper/timing.py, lines 82-151
Triton kernel: whisper/triton_ops.py, lines 13-41 (dtw_kernel)
Repository: https://github.com/openai/whisper
Import: from whisper.timing import dtw

Signatures

Main Dispatcher

def dtw(x: torch.Tensor) -> np.ndarray:

CPU Backend (Numba JIT)

@numba.jit(nopython=True, parallel=True)
def dtw_cpu(x: np.ndarray) -> np.ndarray:

CUDA Backend (Triton)

def dtw_cuda(x, BLOCK_SIZE=1024) -> np.ndarray:

Parameters

Parameter	Type	Description
x	torch.Tensor	Cost matrix of shape (N, M) where N = number of text tokens and M = number of audio frames. Typically the negated attention weights.

Return Value

Returns an np.ndarray of shape (2, path_length) containing:

Row 0: text token indices along the alignment path
Row 1: audio frame indices along the alignment path

Behavior

Dispatcher (dtw)

Checks if the input tensor is on a CUDA device.
If CUDA: calls dtw_cuda(x).
If CPU: calls dtw_cpu(x.double().numpy()).

CPU Backend (dtw_cpu)

Allocates accumulated cost matrix and trace matrix of shape (N, M).
Fills the accumulated cost matrix using the DP recurrence: D[i, j] = C[i, j] + min(D[i-1, j-1], D[i-1, j], D[i, j-1]).
Uses Numba's nopython=True and parallel=True for JIT-compiled, parallelized execution.
Backtraces from (N-1, M-1) to (0, 0) to recover the optimal path.

CUDA Backend (dtw_cuda)

Uses a skewed matrix layout to enable parallel computation of anti-diagonals on the GPU.
Launches a Triton kernel (dtw_kernel) for each anti-diagonal, where all cells on an anti-diagonal are independent and can be computed in parallel.
After the forward pass, performs backtrace on CPU to recover the optimal path.
The BLOCK_SIZE parameter controls Triton kernel parallelism (default 1024).

Backtrace

Both backends produce a trace matrix encoding which predecessor cell was chosen at each position. The backtrace follows this trace from (N-1, M-1) back to (0, 0), collecting the path indices.

Example Usage

from whisper.timing import dtw
import torch

# Negative attention matrix as cost
cost_matrix = -attention_weights  # shape (num_tokens, num_frames)
text_indices, time_indices = dtw(cost_matrix)
# text_indices[k] maps to time_indices[k]

Performance Characteristics

Backend	Technology	Best For
dtw_cpu	Numba JIT (nopython, parallel)	CPU-only environments, moderate-size matrices
dtw_cuda	Triton kernel with anti-diagonal parallelism	GPU environments, large matrices

The anti-diagonal parallelization in the CUDA backend exploits the fact that all cells (i, j) where i + j = k (a constant) are independent given the previous anti-diagonal, allowing them to be computed simultaneously.

Links

Principle:Openai_Whisper_Dynamic_Time_Warping

Environment:Openai_Whisper_Numba Environment:Openai_Whisper_Triton

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment