Heuristic:Rapidsai Cuml CUDA Kernel Caching

Knowledge Sources	cuML JIT compilation overhead patterns
Domains	Optimization, GPU_Computing
Last Updated	2026-02-08 00:00 GMT

Overview

Cache compiled CUDA kernels by parameter signature to avoid repeated JIT compilation overhead. Use named kernels for memoization; unnamed kernels bypass the cache.

Description

cuML uses Numba's CUDA JIT compiler for custom kernels (pairwise kernel functions, custom metrics). JIT compilation is expensive and can dominate runtime on first invocation. The codebase implements two complementary caching strategies: (1) a dictionary cache for pairwise kernel functions keyed by (func, kwds_tuple, X.dtype, Y.dtype), and (2) an LRU cache with maxsize=5000 on the cuda_kernel_factory() function for general CUDA kernels. The LRU cache stores up to 5000 unique kernel combinations, covering the variety of dtypes, block sizes, and function signatures encountered in a typical session.

Usage

Apply this heuristic when experiencing slow first-call latency with custom kernel operations (SVM, pairwise metrics, custom distance functions). The first call compiles the kernel; subsequent calls with the same parameters reuse the cached compiled version. To ensure caching works, provide a fixed kernel_name to cuda_kernel_factory(); passing None generates a UUID-based name that bypasses memoization.

The Insight (Rule of Thumb)

Action: Always use named kernels (not kernel_name=None) when calling cuda_kernel_factory() to enable LRU caching.
Value: LRU cache stores up to 5000 unique kernel signatures. The pairwise kernel cache has no size limit (dictionary).
Trade-off: Cached kernels consume memory. The 5000-entry LRU limit prevents unbounded memory growth while covering typical usage patterns.
First-call penalty: Expect higher latency on the first call with a new kernel configuration. Subsequent calls are near-instant.

Reasoning

CUDA JIT compilation via Numba involves parsing Python functions, generating PTX (CUDA assembly), and loading the compiled module onto the GPU. This process can take tens to hundreds of milliseconds. For iterative algorithms that call the same kernel thousands of times (e.g., SVM training, iterative metric computation), caching avoids compiling the same kernel on every iteration. The 5000-entry LRU limit was chosen empirically to cover the practical variety of kernel configurations without unbounded memory growth.

Code Evidence

Pairwise kernel cache from python/cuml/cuml/metrics/pairwise_kernels.py:139-176:

_kernel_cache = {}

def custom_kernel(X, Y, func, **kwds):
    kwds_tuple = _kwds_to_tuple_args(func, **kwds)
    # ...
    key = (func, kwds_tuple, X.dtype, Y.dtype)
    if key in _kernel_cache:
        compiled_kernel = _kernel_cache[key]
    else:
        compiled_kernel = cuda.jit(evaluate_pairwise_kernels)
        _kernel_cache[key] = compiled_kernel

LRU kernel factory cache from python/cuml/cuml/common/kernel_utils.py:40-104:

@functools.lru_cache(maxsize=5000)
def cuda_kernel_factory(func, dtypes, kernel_name=None):
    # kernel_name=None prevents memoization (uses UUID)
    # fixed kernel names enable caching

Related Pages

No pages currently reference this heuristic via forward links.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment