Environment:Openai Whisper Triton

Knowledge Sources	OpenAI Whisper Triton
Domains	Infrastructure, GPU_Computing
Last Updated	2025-06-25 00:00 GMT

Overview

OpenAI Triton compiler for GPU-accelerated kernels used in Whisper's DTW alignment and median filtering operations. Optional dependency; falls back gracefully to CPU alternatives.

Description

Whisper includes custom Triton GPU kernels for two performance-critical operations in the word-level timestamp pipeline: Dynamic Time Warping (`dtw_kernel`) and median filtering (`median_filter_cuda`). Triton compiles Python-like kernel code to optimized GPU machine code at runtime. Triton is a conditional dependency — it is only installed on x86_64 Linux systems and its import failure is handled gracefully with fallback to CPU alternatives.

Usage

Use this environment when word-level timestamps are requested on a CUDA GPU. The Triton kernels accelerate the DTW alignment and median filtering steps. If Triton is not available or fails to compile, Whisper falls back to Numba (for DTW) and PyTorch sort-based median (for filtering), with a warning.

System Requirements

Category	Requirement	Notes
OS	Linux (x86_64 only)	Triton is conditionally installed only on x86_64 Linux
Hardware	NVIDIA CUDA GPU	Required for Triton kernel execution
CUDA Toolkit	Compatible with installed PyTorch	Triton compiles PTX at runtime

Dependencies

Python Packages

`triton` >= 2.0 (conditional: x86_64 Linux only)
`torch` with CUDA support

Credentials

No credentials required.

Quick Install

# Automatically installed on x86_64 Linux via:
pip install openai-whisper

# Or install manually:
pip install triton>=2.0

Code Evidence

Conditional dependency in `pyproject.toml:31`:

"triton>=2; (platform_machine=='x86_64' and sys_platform=='linux') or sys_platform=='linux2'",

Import with error handling in `whisper/triton_ops.py:6-10`:

try:
    import triton
    import triton.language as tl
except ImportError:
    raise RuntimeError("triton import failed; try `pip install --pre triton`")

Graceful fallback for DTW from `whisper/timing.py:142-151`:

def dtw(x: torch.Tensor) -> np.ndarray:
    if x.is_cuda:
        try:
            return dtw_cuda(x)
        except (RuntimeError, subprocess.CalledProcessError):
            warnings.warn(
                "Failed to launch Triton kernels, likely due to missing CUDA toolkit; "
                "falling back to a slower DTW implementation..."
            )
    return dtw_cpu(x.double().cpu().numpy())

Graceful fallback for median filter from `whisper/timing.py:36-45`:

if x.is_cuda:
    try:
        from .triton_ops import median_filter_cuda
        result = median_filter_cuda(x, filter_width)
    except (RuntimeError, subprocess.CalledProcessError):
        warnings.warn(
            "Failed to launch Triton kernels, likely due to missing CUDA toolkit; "
            "falling back to a slower median kernel implementation..."
        )

Common Errors

Error Message	Cause	Solution
`RuntimeError: triton import failed; try pip install --pre triton`	Triton not installed when `triton_ops.py` is imported directly	`pip install triton>=2.0` (Linux x86_64 only)
`Failed to launch Triton kernels, likely due to missing CUDA toolkit`	CUDA toolkit not properly installed or incompatible	Install matching CUDA toolkit or ignore warning (CPU fallback is used)
`RuntimeError` or `CalledProcessError` during kernel compilation	Triton compilation failure	Triton falls back to CPU path automatically; ensure CUDA toolkit matches PyTorch

Compatibility Notes

Linux x86_64 only: Triton is not installed on macOS, Windows, or ARM Linux. This is handled at the package level via conditional dependency in `pyproject.toml`.
Lazy import: Triton is only imported when GPU-accelerated DTW or median filtering is actually invoked (via `from .triton_ops import ...`). CPU-only workflows never trigger the import.
Graceful degradation: Both `dtw()` and `median_filter()` catch `RuntimeError` and `CalledProcessError` and fall back to CPU alternatives with a warning message.
DTW BLOCK_SIZE constraint: The CUDA DTW kernel requires that the number of text tokens (M) be less than the `BLOCK_SIZE` (default 1024). This is asserted in `dtw_cuda()`.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment