Heuristic:Rapidsai Cuml Dask Data Partitioning

Knowledge Sources	cuML Distributed RF OOM patterns
Domains	Distributed_Computing, Optimization
Last Updated	2026-02-08 00:00 GMT

Overview

Ensure exactly one data partition per Dask worker for distributed cuML algorithms to prevent memory explosion from partition concatenation.

Description

When using cuML's Dask-based distributed algorithms (e.g., distributed Random Forest, distributed KMeans), data must be properly partitioned across GPU workers. If a worker receives multiple partitions, cuML concatenates them into a single array, causing unexpected memory spikes. The optimal configuration is exactly one partition per worker. Additionally, use persist_across_workers() to ensure data is properly co-located with the workers that need it, rather than relying on Dask's lazy evaluation which may transfer data at compute time.

Usage

Apply this heuristic when running any distributed cuML algorithm via cuml.dask. If you encounter worker OOM errors or unexpectedly high memory usage, check that your data is partitioned with one partition per GPU worker. Also applies to broadcast decisions: for models larger than data, set broadcast_data=True; for data larger than model, use the default.

The Insight (Rule of Thumb)

Action: Partition your Dask DataFrame or array to have exactly one partition per GPU worker. Use persist_across_workers() after partitioning.
Value: n_partitions = n_workers (number of GPUs).
Trade-off: Repartitioning adds a shuffle step, but prevents memory explosion from concatenation. Using broadcast_data=True replicates the full dataset to all workers (high memory) but avoids communication during predict.
Anti-pattern: Having 100 partitions across 4 workers means 25 partitions per worker are concatenated, using 25x memory on each.

Reasoning

cuML's distributed algorithms are designed around a one-partition-per-worker model. When a worker has multiple partitions assigned to it, the framework concatenates them into a single contiguous array before passing to the GPU algorithm. This concatenation doubles memory usage temporarily (original partitions + concatenated result) and may exceed GPU VRAM. The Dask scheduler does not know about GPU memory constraints, so it may naively assign many partitions to a single worker.

Code Evidence

Data distribution requirement from python/cuml/cuml/dask/ensemble/randomforestclassifier.py:157-162:

# IMPORTANT: X is expected to be partitioned with at least one partition
# on each Dask worker being used by the forest (self.workers).

Persist pattern from python/cuml/cuml/dask/ensemble/randomforestclassifier.py:161-162:

X_dask_cudf, y_dask_cudf = persist_across_workers(
    dask_client, [X_dask_cudf, y_dask_cudf])

Broadcast data option from python/cuml/cuml/dask/ensemble/randomforestclassifier.py:268-274:

broadcast_data : bool (default = False)
    When set to True, the whole dataset is broadcasted to train the
    workers. When False (default), each worker is trained on its
    partition. May be advantageous when the model is larger than
    the data used for inference.

Dask shuffle workaround from python/cuml/cuml/dask/__init__.py:43-44:

# Avoid "p2p" shuffling in dask for now
dask.config.set({"dataframe.shuffle.method": "tasks"})

Related Pages

No pages currently reference this heuristic via forward links.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment