Heuristic:Rapidsai Cuml GPU Cache Alignment
| Knowledge Sources | |
|---|---|
| Domains | Optimization, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Align tree model data to cache-line boundaries (128 bytes GPU, 64 bytes CPU) and tune FIL chunk size for optimal inference throughput.
Description
The Forest Inference Library (FIL) in cuML provides multiple tuning knobs for inference performance. Tree data can be padded to align on cache-line boundaries, which ensures that memory reads begin at optimal addresses. On GPU, the typical cache line is 128 bytes; on CPU, it is 64 bytes. Additionally, FIL processes rows in chunks, and the optimal chunk size depends on the tree layout (depth-first vs breadth-first), model complexity, and hardware. The optimize() method auto-tunes both layout and chunk size by benchmarking candidate configurations.
Usage
Apply this heuristic when deploying Random Forest or boosted tree models for production inference using FIL. After loading a model, call ForestInference.optimize() to automatically find the best layout and chunk size. For manual tuning, set align_bytes=128 on GPU or align_bytes=64 on CPU.
The Insight (Rule of Thumb)
- Action: Call
ForestInference.optimize()after loading a model to auto-tune inference parameters. - Value: Set
align_bytes=128for GPU inference,align_bytes=64for CPU inference. GPU chunk sizes should be powers of 2 in range 1-32; CPU chunk sizes can go up to 512. - Trade-off: Alignment padding increases model memory footprint slightly but improves memory access throughput. The optimize method adds startup latency (default 0.2s timeout) but yields faster steady-state inference.
- Layout Choice:
'depth_first'is default and generally good.'breadth_first'may be better for very shallow trees. The optimizer tests both.
Reasoning
Modern GPUs and CPUs use cache lines as the fundamental unit of memory transfer. When tree traversal reads node data that spans a cache line boundary, two memory transactions are required instead of one. Padding trees to align on cache-line boundaries ensures each tree starts at an optimal address. The chunk size determines how many rows are processed per kernel launch: too small wastes kernel launch overhead, too large may overflow shared memory or reduce occupancy. The optimize() method empirically benchmarks configurations because the optimal point depends on the specific model structure and hardware.
Code Evidence
Alignment parameter from python/cuml/cuml/ensemble/randomforestclassifier.py:378-386:
align_bytes : int
If specified, trees will be padded such that their in-memory size
is a multiple of this value. This can improve performance by
guaranteeing that memory reads from trees begin on a cache line
boundary. Typical values are 0 or 128 on GPU and 0 or 64 on CPU.
FIL optimize method from python/cuml/cuml/fil/fil.pyx:1216-1259:
def optimize(self, *, data=None, batch_size=1024, unique_batches=10,
timeout=0.2, predict_method='predict', max_chunk_size=None, seed=0):
"""Find the optimal layout and chunk size for this model.
The optimal value for layout and chunk size depends on the model,
batch size, and available hardware. After finding the optimal layout,
the model will be reloaded if necessary."""
GPU introspection for shared memory from cpp/include/cuml/fil/detail/gpu_introspection.hpp:25-30:
inline auto max_shared_mem_per_block(int device = 0)
{
auto result = int{};
RAFT_CUDA_TRY(cudaDeviceGetAttribute(
&result, cudaDevAttrMaxSharedMemoryPerBlockOptin, device));
return result;
}
Related Pages
No pages currently reference this heuristic via forward links.