Heuristic:Rapidsai Cuml Quantile Split Differences

Knowledge Sources	cuML cuML Random Forest Docs
Domains	Machine_Learning, GPU_Computing
Last Updated	2026-02-08 00:00 GMT

Overview

cuML Random Forest uses quantile-based splits instead of exact counts, so results differ from scikit-learn. Tune n_bins to control the accuracy/performance trade-off.

Description

The cuML Random Forest implementation uses a fundamentally different split algorithm than scikit-learn. Instead of evaluating every possible split point (exact method), cuML quantizes feature values into histogram bins and selects the best bin boundary as the split point. The default is 128 bins. This design enables massive GPU parallelism but means split decisions are approximate. For highly-skewed data distributions where important split points cluster in narrow ranges, increasing n_bins provides finer granularity. The max_depth default also differs: cuML defaults to 16 while sklearn defaults to unlimited.

Usage

Apply this heuristic when migrating Random Forest models from scikit-learn to cuML or when comparing results between GPU and CPU implementations. Expect numerical differences in predictions. If accuracy is lower than expected, try increasing n_bins (e.g., 256 or 512) at the cost of higher memory usage and slightly slower training.

The Insight (Rule of Thumb)

Action: Be aware that cuML RF uses quantile-based splits, not exact counts. Do not expect bit-for-bit identical results with sklearn.
Value: Default n_bins=128. Increase to 256 or 512 for highly-skewed data. Decrease to 64 for faster training on well-distributed data.
Trade-off: More bins = better split approximation but more memory and slower histogram computation. Fewer bins = faster but coarser splits.
Additional Difference: cuML max_depth defaults to 16 (sklearn defaults to unlimited). Set explicitly for cross-framework consistency.

Reasoning

GPU architectures excel at data-parallel operations. The quantile-based histogram approach converts the split-finding problem into a parallel reduction over bins, which maps efficiently to GPU hardware. The exact split approach used by sklearn requires sorting per feature per node, which is inherently sequential. The 128-bin default balances split quality (128 candidate thresholds per feature) against memory (histogram storage scales with n_bins * n_features * n_classes). For most datasets, 128 bins capture the important structure of the feature distribution.

Code Evidence

Algorithm difference documentation from python/cuml/cuml/ensemble/randomforestclassifier.py:25-29:

.. note:: Note that the underlying algorithm for tree node splits differs
  from that used in scikit-learn. By default, the cuML Random Forest uses a
  quantile-based algorithm to determine splits, rather than an exact
  count. You can tune the size of the quantiles with the `n_bins`
  parameter.

n_bins parameter documentation from python/cuml/cuml/ensemble/randomforestclassifier.py:93-96:

n_bins : int (default = 128)
    Maximum number of bins used by the split algorithm per feature.
    For large problems, particularly those with highly-skewed input data,
    increasing the number of bins may improve accuracy.

max_depth default difference from python/cuml/cuml/ensemble/randomforestclassifier.py:72-77:

max_depth : int (default = 16)
    Maximum tree depth. Must be greater than 0.
    Unlimited depth (i.e, until leaves are pure)
    is not supported.
    .. note:: This default differs from scikit-learn's
      random forest, which defaults to unlimited depth.

Related Pages

No pages currently reference this heuristic via forward links.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment