Heuristic:Rapidsai Cuml CUDA Kernel Caching
| Knowledge Sources | |
|---|---|
| Domains | Optimization, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Cache compiled CUDA kernels by parameter signature to avoid repeated JIT compilation overhead. Use named kernels for memoization; unnamed kernels bypass the cache.
Description
cuML uses Numba's CUDA JIT compiler for custom kernels (pairwise kernel functions, custom metrics). JIT compilation is expensive and can dominate runtime on first invocation. The codebase implements two complementary caching strategies: (1) a dictionary cache for pairwise kernel functions keyed by (func, kwds_tuple, X.dtype, Y.dtype), and (2) an LRU cache with maxsize=5000 on the cuda_kernel_factory() function for general CUDA kernels. The LRU cache stores up to 5000 unique kernel combinations, covering the variety of dtypes, block sizes, and function signatures encountered in a typical session.
Usage
Apply this heuristic when experiencing slow first-call latency with custom kernel operations (SVM, pairwise metrics, custom distance functions). The first call compiles the kernel; subsequent calls with the same parameters reuse the cached compiled version. To ensure caching works, provide a fixed kernel_name to cuda_kernel_factory(); passing None generates a UUID-based name that bypasses memoization.
The Insight (Rule of Thumb)
- Action: Always use named kernels (not
kernel_name=None) when callingcuda_kernel_factory()to enable LRU caching. - Value: LRU cache stores up to 5000 unique kernel signatures. The pairwise kernel cache has no size limit (dictionary).
- Trade-off: Cached kernels consume memory. The 5000-entry LRU limit prevents unbounded memory growth while covering typical usage patterns.
- First-call penalty: Expect higher latency on the first call with a new kernel configuration. Subsequent calls are near-instant.
Reasoning
CUDA JIT compilation via Numba involves parsing Python functions, generating PTX (CUDA assembly), and loading the compiled module onto the GPU. This process can take tens to hundreds of milliseconds. For iterative algorithms that call the same kernel thousands of times (e.g., SVM training, iterative metric computation), caching avoids compiling the same kernel on every iteration. The 5000-entry LRU limit was chosen empirically to cover the practical variety of kernel configurations without unbounded memory growth.
Code Evidence
Pairwise kernel cache from python/cuml/cuml/metrics/pairwise_kernels.py:139-176:
_kernel_cache = {}
def custom_kernel(X, Y, func, **kwds):
kwds_tuple = _kwds_to_tuple_args(func, **kwds)
# ...
key = (func, kwds_tuple, X.dtype, Y.dtype)
if key in _kernel_cache:
compiled_kernel = _kernel_cache[key]
else:
compiled_kernel = cuda.jit(evaluate_pairwise_kernels)
_kernel_cache[key] = compiled_kernel
LRU kernel factory cache from python/cuml/cuml/common/kernel_utils.py:40-104:
@functools.lru_cache(maxsize=5000)
def cuda_kernel_factory(func, dtypes, kernel_name=None):
# kernel_name=None prevents memoization (uses UUID)
# fixed kernel names enable caching
Related Pages
No pages currently reference this heuristic via forward links.