Heuristic:Rapidsai Cuml Dask Data Partitioning
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Optimization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Ensure exactly one data partition per Dask worker for distributed cuML algorithms to prevent memory explosion from partition concatenation.
Description
When using cuML's Dask-based distributed algorithms (e.g., distributed Random Forest, distributed KMeans), data must be properly partitioned across GPU workers. If a worker receives multiple partitions, cuML concatenates them into a single array, causing unexpected memory spikes. The optimal configuration is exactly one partition per worker. Additionally, use persist_across_workers() to ensure data is properly co-located with the workers that need it, rather than relying on Dask's lazy evaluation which may transfer data at compute time.
Usage
Apply this heuristic when running any distributed cuML algorithm via cuml.dask. If you encounter worker OOM errors or unexpectedly high memory usage, check that your data is partitioned with one partition per GPU worker. Also applies to broadcast decisions: for models larger than data, set broadcast_data=True; for data larger than model, use the default.
The Insight (Rule of Thumb)
- Action: Partition your Dask DataFrame or array to have exactly one partition per GPU worker. Use
persist_across_workers()after partitioning. - Value:
n_partitions = n_workers(number of GPUs). - Trade-off: Repartitioning adds a shuffle step, but prevents memory explosion from concatenation. Using
broadcast_data=Truereplicates the full dataset to all workers (high memory) but avoids communication during predict. - Anti-pattern: Having 100 partitions across 4 workers means 25 partitions per worker are concatenated, using 25x memory on each.
Reasoning
cuML's distributed algorithms are designed around a one-partition-per-worker model. When a worker has multiple partitions assigned to it, the framework concatenates them into a single contiguous array before passing to the GPU algorithm. This concatenation doubles memory usage temporarily (original partitions + concatenated result) and may exceed GPU VRAM. The Dask scheduler does not know about GPU memory constraints, so it may naively assign many partitions to a single worker.
Code Evidence
Data distribution requirement from python/cuml/cuml/dask/ensemble/randomforestclassifier.py:157-162:
# IMPORTANT: X is expected to be partitioned with at least one partition
# on each Dask worker being used by the forest (self.workers).
Persist pattern from python/cuml/cuml/dask/ensemble/randomforestclassifier.py:161-162:
X_dask_cudf, y_dask_cudf = persist_across_workers(
dask_client, [X_dask_cudf, y_dask_cudf])
Broadcast data option from python/cuml/cuml/dask/ensemble/randomforestclassifier.py:268-274:
broadcast_data : bool (default = False)
When set to True, the whole dataset is broadcasted to train the
workers. When False (default), each worker is trained on its
partition. May be advantageous when the model is larger than
the data used for inference.
Dask shuffle workaround from python/cuml/cuml/dask/__init__.py:43-44:
# Avoid "p2p" shuffling in dask for now
dask.config.set({"dataframe.shuffle.method": "tasks"})
Related Pages
No pages currently reference this heuristic via forward links.