Heuristic:Rapidsai Cuml Batch Size Memory Tradeoff

Knowledge Sources	cuML OOM prevention patterns
Domains	Optimization, GPU_Computing
Last Updated	2026-02-08 00:00 GMT

Overview

Use batch subdivision parameters (max_mbytes_per_batch, max_batch_size, batch_size=5*n_features) to trade runtime for reduced GPU memory usage when encountering OOM errors.

Description

Several cuML algorithms have O(N^2) or large memory footprints that can exhaust GPU VRAM. The codebase provides explicit batch-sizing parameters that subdivide computation into smaller chunks. DBSCAN uses max_mbytes_per_batch to cap pairwise distance computation memory. Random Forest uses max_batch_size=4096 to limit nodes processed per GPU batch. IncrementalPCA defaults to batch_size=5*n_features as an empirically-determined balance between accuracy and memory. FIL (Forest Inference Library) uses chunk_size to control rows per inference batch (GPU: 1-32 power of 2, CPU: up to 512).

Usage

Apply this heuristic when you encounter CUDA out of memory errors during DBSCAN clustering, Random Forest training, incremental PCA, or FIL inference. Start by reducing the batch size parameter for the specific algorithm, then monitor memory usage to find the optimal trade-off between runtime and memory.

The Insight (Rule of Thumb)

Action (DBSCAN): Set max_mbytes_per_batch based on your GPU memory. For a 16GB GPU, try 4096 or 8192 MiB.
Action (Random Forest): Reduce max_batch_size from default 4096 if OOM occurs during tree building.
Action (IncrementalPCA): The default batch_size = 5 * n_features provides good accuracy/memory balance. Reduce for very high-dimensional data.
Action (FIL): Use ForestInference.optimize() to auto-tune chunk_size and layout for your model and hardware.
Trade-off: Smaller batches reduce peak memory but increase runtime due to more kernel launches and reduced parallelism.
Note: max_mbytes_per_batch does NOT cap total memory usage; it only limits the pairwise distance computation portion.

Reasoning

GPU memory is a hard constraint: exceeding VRAM causes an immediate crash (CUDA OOM). Unlike CPU-based systems that can swap to disk, GPU algorithms must fit their working set entirely in VRAM. Batch subdivision allows trading compute time (more kernel launches, less parallelism) for reduced peak memory. The specific default values (4096 nodes for RF, 5x features for IPCA) are empirically tuned across representative workloads.

Code Evidence

DBSCAN batch memory control from python/cuml/cuml/dask/cluster/dbscan.py:38-47:

max_mbytes_per_batch : (optional) int64
    Calculate batch size using no more than this number of megabytes for
    the pairwise distance computation. This enables the trade-off between
    runtime and memory usage for making the N^2 pairwise distance
    computations more tractable for large numbers of samples.
    If you are experiencing out of memory errors when running DBSCAN, you
    can set this value based on the memory size of your device.
    Note: this option does not set the maximum total memory used in the
    DBSCAN computation.

IncrementalPCA default batch size from python/cuml/cuml/decomposition/incremental_pca.py:236-239:

if self.batch_size_ is None:
    self.batch_size_ = 5 * n_features

Random Forest batch limit from python/cuml/cuml/ensemble/randomforestclassifier.py:116-117:

max_batch_size : int (default = 4096)
    Maximum number of nodes that can be processed in a given batch.

FIL chunk size optimization from python/cuml/cuml/fil/fil.pyx:1216-1259:

def optimize(self, *, data=None, batch_size=1024, unique_batches=10,
             timeout=0.2, predict_method='predict', max_chunk_size=None, seed=0):
    """Find the optimal layout and chunk size for this model.
    The optimal value for layout and chunk size depends on the model,
    batch size, and available hardware."""

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment