Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper DTW

From Leeroopedia

Overview

dtw() is a dispatcher function that performs Dynamic Time Warping on a cost matrix, selecting between a Numba JIT-compiled CPU backend and a Triton-based CUDA backend depending on the tensor device. It is used in Whisper's word-level timestamp pipeline to find the optimal monotonic alignment between text tokens and audio frames.

Source

  • File: whisper/timing.py, lines 82-151
  • Triton kernel: whisper/triton_ops.py, lines 13-41 (dtw_kernel)
  • Repository: https://github.com/openai/whisper
  • Import: from whisper.timing import dtw

Signatures

Main Dispatcher

def dtw(x: torch.Tensor) -> np.ndarray:

CPU Backend (Numba JIT)

@numba.jit(nopython=True, parallel=True)
def dtw_cpu(x: np.ndarray) -> np.ndarray:

CUDA Backend (Triton)

def dtw_cuda(x, BLOCK_SIZE=1024) -> np.ndarray:

Parameters

Parameter Type Description
x torch.Tensor Cost matrix of shape (N, M) where N = number of text tokens and M = number of audio frames. Typically the negated attention weights.

Return Value

Returns an np.ndarray of shape (2, path_length) containing:

  • Row 0: text token indices along the alignment path
  • Row 1: audio frame indices along the alignment path

Behavior

Dispatcher (dtw)

  1. Checks if the input tensor is on a CUDA device.
  2. If CUDA: calls dtw_cuda(x).
  3. If CPU: calls dtw_cpu(x.double().numpy()).

CPU Backend (dtw_cpu)

  1. Allocates accumulated cost matrix and trace matrix of shape (N, M).
  2. Fills the accumulated cost matrix using the DP recurrence: D[i, j] = C[i, j] + min(D[i-1, j-1], D[i-1, j], D[i, j-1]).
  3. Uses Numba's nopython=True and parallel=True for JIT-compiled, parallelized execution.
  4. Backtraces from (N-1, M-1) to (0, 0) to recover the optimal path.

CUDA Backend (dtw_cuda)

  1. Uses a skewed matrix layout to enable parallel computation of anti-diagonals on the GPU.
  2. Launches a Triton kernel (dtw_kernel) for each anti-diagonal, where all cells on an anti-diagonal are independent and can be computed in parallel.
  3. After the forward pass, performs backtrace on CPU to recover the optimal path.
  4. The BLOCK_SIZE parameter controls Triton kernel parallelism (default 1024).

Backtrace

Both backends produce a trace matrix encoding which predecessor cell was chosen at each position. The backtrace follows this trace from (N-1, M-1) back to (0, 0), collecting the path indices.

Example Usage

from whisper.timing import dtw
import torch

# Negative attention matrix as cost
cost_matrix = -attention_weights  # shape (num_tokens, num_frames)
text_indices, time_indices = dtw(cost_matrix)
# text_indices[k] maps to time_indices[k]

Performance Characteristics

Backend Technology Best For
dtw_cpu Numba JIT (nopython, parallel) CPU-only environments, moderate-size matrices
dtw_cuda Triton kernel with anti-diagonal parallelism GPU environments, large matrices

The anti-diagonal parallelization in the CUDA backend exploits the fact that all cells (i, j) where i + j = k (a constant) are independent given the previous anti-diagonal, allowing them to be computed simultaneously.

Links

Principle:Openai_Whisper_Dynamic_Time_Warping

Environment:Openai_Whisper_Numba Environment:Openai_Whisper_Triton

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment