Heuristic:Rapidsai Cuml Batch Size Memory Tradeoff
| Knowledge Sources | |
|---|---|
| Domains | Optimization, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Use batch subdivision parameters (max_mbytes_per_batch, max_batch_size, batch_size=5*n_features) to trade runtime for reduced GPU memory usage when encountering OOM errors.
Description
Several cuML algorithms have O(N^2) or large memory footprints that can exhaust GPU VRAM. The codebase provides explicit batch-sizing parameters that subdivide computation into smaller chunks. DBSCAN uses max_mbytes_per_batch to cap pairwise distance computation memory. Random Forest uses max_batch_size=4096 to limit nodes processed per GPU batch. IncrementalPCA defaults to batch_size=5*n_features as an empirically-determined balance between accuracy and memory. FIL (Forest Inference Library) uses chunk_size to control rows per inference batch (GPU: 1-32 power of 2, CPU: up to 512).
Usage
Apply this heuristic when you encounter CUDA out of memory errors during DBSCAN clustering, Random Forest training, incremental PCA, or FIL inference. Start by reducing the batch size parameter for the specific algorithm, then monitor memory usage to find the optimal trade-off between runtime and memory.
The Insight (Rule of Thumb)
- Action (DBSCAN): Set
max_mbytes_per_batchbased on your GPU memory. For a 16GB GPU, try 4096 or 8192 MiB. - Action (Random Forest): Reduce
max_batch_sizefrom default 4096 if OOM occurs during tree building. - Action (IncrementalPCA): The default
batch_size = 5 * n_featuresprovides good accuracy/memory balance. Reduce for very high-dimensional data. - Action (FIL): Use
ForestInference.optimize()to auto-tunechunk_sizeandlayoutfor your model and hardware. - Trade-off: Smaller batches reduce peak memory but increase runtime due to more kernel launches and reduced parallelism.
- Note:
max_mbytes_per_batchdoes NOT cap total memory usage; it only limits the pairwise distance computation portion.
Reasoning
GPU memory is a hard constraint: exceeding VRAM causes an immediate crash (CUDA OOM). Unlike CPU-based systems that can swap to disk, GPU algorithms must fit their working set entirely in VRAM. Batch subdivision allows trading compute time (more kernel launches, less parallelism) for reduced peak memory. The specific default values (4096 nodes for RF, 5x features for IPCA) are empirically tuned across representative workloads.
Code Evidence
DBSCAN batch memory control from python/cuml/cuml/dask/cluster/dbscan.py:38-47:
max_mbytes_per_batch : (optional) int64
Calculate batch size using no more than this number of megabytes for
the pairwise distance computation. This enables the trade-off between
runtime and memory usage for making the N^2 pairwise distance
computations more tractable for large numbers of samples.
If you are experiencing out of memory errors when running DBSCAN, you
can set this value based on the memory size of your device.
Note: this option does not set the maximum total memory used in the
DBSCAN computation.
IncrementalPCA default batch size from python/cuml/cuml/decomposition/incremental_pca.py:236-239:
if self.batch_size_ is None:
self.batch_size_ = 5 * n_features
Random Forest batch limit from python/cuml/cuml/ensemble/randomforestclassifier.py:116-117:
max_batch_size : int (default = 4096)
Maximum number of nodes that can be processed in a given batch.
FIL chunk size optimization from python/cuml/cuml/fil/fil.pyx:1216-1259:
def optimize(self, *, data=None, batch_size=1024, unique_batches=10,
timeout=0.2, predict_method='predict', max_chunk_size=None, seed=0):
"""Find the optimal layout and chunk size for this model.
The optimal value for layout and chunk size depends on the model,
batch size, and available hardware."""