Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Rapidsai Cuml GPU Cache Alignment

From Leeroopedia






Knowledge Sources
Domains Optimization, GPU_Computing
Last Updated 2026-02-08 00:00 GMT

Overview

Align tree model data to cache-line boundaries (128 bytes GPU, 64 bytes CPU) and tune FIL chunk size for optimal inference throughput.

Description

The Forest Inference Library (FIL) in cuML provides multiple tuning knobs for inference performance. Tree data can be padded to align on cache-line boundaries, which ensures that memory reads begin at optimal addresses. On GPU, the typical cache line is 128 bytes; on CPU, it is 64 bytes. Additionally, FIL processes rows in chunks, and the optimal chunk size depends on the tree layout (depth-first vs breadth-first), model complexity, and hardware. The optimize() method auto-tunes both layout and chunk size by benchmarking candidate configurations.

Usage

Apply this heuristic when deploying Random Forest or boosted tree models for production inference using FIL. After loading a model, call ForestInference.optimize() to automatically find the best layout and chunk size. For manual tuning, set align_bytes=128 on GPU or align_bytes=64 on CPU.

The Insight (Rule of Thumb)

  • Action: Call ForestInference.optimize() after loading a model to auto-tune inference parameters.
  • Value: Set align_bytes=128 for GPU inference, align_bytes=64 for CPU inference. GPU chunk sizes should be powers of 2 in range 1-32; CPU chunk sizes can go up to 512.
  • Trade-off: Alignment padding increases model memory footprint slightly but improves memory access throughput. The optimize method adds startup latency (default 0.2s timeout) but yields faster steady-state inference.
  • Layout Choice: 'depth_first' is default and generally good. 'breadth_first' may be better for very shallow trees. The optimizer tests both.

Reasoning

Modern GPUs and CPUs use cache lines as the fundamental unit of memory transfer. When tree traversal reads node data that spans a cache line boundary, two memory transactions are required instead of one. Padding trees to align on cache-line boundaries ensures each tree starts at an optimal address. The chunk size determines how many rows are processed per kernel launch: too small wastes kernel launch overhead, too large may overflow shared memory or reduce occupancy. The optimize() method empirically benchmarks configurations because the optimal point depends on the specific model structure and hardware.

Code Evidence

Alignment parameter from python/cuml/cuml/ensemble/randomforestclassifier.py:378-386:

align_bytes : int
    If specified, trees will be padded such that their in-memory size
    is a multiple of this value. This can improve performance by
    guaranteeing that memory reads from trees begin on a cache line
    boundary. Typical values are 0 or 128 on GPU and 0 or 64 on CPU.

FIL optimize method from python/cuml/cuml/fil/fil.pyx:1216-1259:

def optimize(self, *, data=None, batch_size=1024, unique_batches=10,
             timeout=0.2, predict_method='predict', max_chunk_size=None, seed=0):
    """Find the optimal layout and chunk size for this model.
    The optimal value for layout and chunk size depends on the model,
    batch size, and available hardware. After finding the optimal layout,
    the model will be reloaded if necessary."""

GPU introspection for shared memory from cpp/include/cuml/fil/detail/gpu_introspection.hpp:25-30:

inline auto max_shared_mem_per_block(int device = 0)
{
  auto result = int{};
  RAFT_CUDA_TRY(cudaDeviceGetAttribute(
    &result, cudaDevAttrMaxSharedMemoryPerBlockOptin, device));
  return result;
}

Related Pages

No pages currently reference this heuristic via forward links.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment