Environment:Openai Whisper Triton
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing |
| Last Updated | 2025-06-25 00:00 GMT |
Overview
OpenAI Triton compiler for GPU-accelerated kernels used in Whisper's DTW alignment and median filtering operations. Optional dependency; falls back gracefully to CPU alternatives.
Description
Whisper includes custom Triton GPU kernels for two performance-critical operations in the word-level timestamp pipeline: Dynamic Time Warping (`dtw_kernel`) and median filtering (`median_filter_cuda`). Triton compiles Python-like kernel code to optimized GPU machine code at runtime. Triton is a conditional dependency — it is only installed on x86_64 Linux systems and its import failure is handled gracefully with fallback to CPU alternatives.
Usage
Use this environment when word-level timestamps are requested on a CUDA GPU. The Triton kernels accelerate the DTW alignment and median filtering steps. If Triton is not available or fails to compile, Whisper falls back to Numba (for DTW) and PyTorch sort-based median (for filtering), with a warning.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (x86_64 only) | Triton is conditionally installed only on x86_64 Linux |
| Hardware | NVIDIA CUDA GPU | Required for Triton kernel execution |
| CUDA Toolkit | Compatible with installed PyTorch | Triton compiles PTX at runtime |
Dependencies
Python Packages
- `triton` >= 2.0 (conditional: x86_64 Linux only)
- `torch` with CUDA support
Credentials
No credentials required.
Quick Install
# Automatically installed on x86_64 Linux via:
pip install openai-whisper
# Or install manually:
pip install triton>=2.0
Code Evidence
Conditional dependency in `pyproject.toml:31`:
"triton>=2; (platform_machine=='x86_64' and sys_platform=='linux') or sys_platform=='linux2'",
Import with error handling in `whisper/triton_ops.py:6-10`:
try:
import triton
import triton.language as tl
except ImportError:
raise RuntimeError("triton import failed; try `pip install --pre triton`")
Graceful fallback for DTW from `whisper/timing.py:142-151`:
def dtw(x: torch.Tensor) -> np.ndarray:
if x.is_cuda:
try:
return dtw_cuda(x)
except (RuntimeError, subprocess.CalledProcessError):
warnings.warn(
"Failed to launch Triton kernels, likely due to missing CUDA toolkit; "
"falling back to a slower DTW implementation..."
)
return dtw_cpu(x.double().cpu().numpy())
Graceful fallback for median filter from `whisper/timing.py:36-45`:
if x.is_cuda:
try:
from .triton_ops import median_filter_cuda
result = median_filter_cuda(x, filter_width)
except (RuntimeError, subprocess.CalledProcessError):
warnings.warn(
"Failed to launch Triton kernels, likely due to missing CUDA toolkit; "
"falling back to a slower median kernel implementation..."
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: triton import failed; try pip install --pre triton` | Triton not installed when `triton_ops.py` is imported directly | `pip install triton>=2.0` (Linux x86_64 only) |
| `Failed to launch Triton kernels, likely due to missing CUDA toolkit` | CUDA toolkit not properly installed or incompatible | Install matching CUDA toolkit or ignore warning (CPU fallback is used) |
| `RuntimeError` or `CalledProcessError` during kernel compilation | Triton compilation failure | Triton falls back to CPU path automatically; ensure CUDA toolkit matches PyTorch |
Compatibility Notes
- Linux x86_64 only: Triton is not installed on macOS, Windows, or ARM Linux. This is handled at the package level via conditional dependency in `pyproject.toml`.
- Lazy import: Triton is only imported when GPU-accelerated DTW or median filtering is actually invoked (via `from .triton_ops import ...`). CPU-only workflows never trigger the import.
- Graceful degradation: Both `dtw()` and `median_filter()` catch `RuntimeError` and `CalledProcessError` and fall back to CPU alternatives with a warning message.
- DTW BLOCK_SIZE constraint: The CUDA DTW kernel requires that the number of text tokens (M) be less than the `BLOCK_SIZE` (default 1024). This is asserted in `dtw_cuda()`.