Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Rapidsai Cuml Dask Distributed

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Computing
Last Updated 2026-02-08 00:00 GMT

Overview

Multi-GPU distributed computing environment using Dask, dask-cuda, dask-cudf, and RAFT distributed communication for scaling cuML across multiple GPUs and nodes.

Description

This environment extends the base cuML CUDA GPU environment with distributed computing capabilities. It uses Dask for task scheduling across multiple GPU workers, dask-cuda for GPU-aware cluster management, dask-cudf for distributed GPU DataFrames, and raft-dask for NCCL/UCX-based inter-GPU communication. The environment supports both single-node multi-GPU and multi-node configurations.

Usage

Use this environment when running cuML distributed algorithms via cuml.dask. This is required for multi-GPU KMeans, NearestNeighbors, PCA, TruncatedSVD, LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression, DBSCAN, and distributed Random Forest training. Single-GPU cuML does not require this environment.

System Requirements

Category Requirement Notes
Hardware Multiple NVIDIA GPUs One Dask worker spawned per GPU
Networking NVLink (recommended) For high-bandwidth GPU-to-GPU transfer on single node
Networking InfiniBand (optional) For multi-node GPU-to-GPU transfer
Protocol UCX (recommended) GPU-direct RDMA for optimal performance

Dependencies

Additional Packages (beyond base cuML)

  • dask-cudf == 26.4.*
  • raft-dask == 26.4.*
  • rapids-dask-dependency == 26.4.*
  • dask-cuda == 26.4.* (for LocalCUDACluster)
  • dask-ml >= 2024 (optional, for testing)

External Libraries

  • UCX library (optional but recommended for GPU-direct networking)
  • NCCL (included with CUDA toolkit, used by RAFT Comms)

Credentials

  • CUDA_VISIBLE_DEVICES: Controls which GPUs are assigned to Dask workers.
  • DASK_DISTRIBUTED__COMM__UCX__TCP: UCX over TCP configuration.
  • UCX_TLS: UCX transport layer selection.

Quick Install

# Install cuML with Dask extras (CUDA 12.x)
pip install cuml-cu12[dask]

# Install with conda
conda install -c rapidsai -c conda-forge -c nvidia cuml dask-cuda dask-cudf raft-dask

Code Evidence

Dask dependency validation from python/cuml/cuml/dask/__init__.py:5-25:

try:
    import dask
    import dask.distributed as _
    import dask_cudf as _
    import raft_dask as _
    del _
except ModuleNotFoundError as exc:
    from cupy.cuda import get_local_runtime_version
    ver = f"cu{str(get_local_runtime_version())[:2]}"
    raise ModuleNotFoundError(
        f"{exc!s}\n\n"
        "Not all requirements for using `cuml.dask` are installed.\n\n"
        "# Install with pip:\n"
        f"  pip install cuml-{ver}[dask]"
    ) from exc

Dask shuffle workaround from python/cuml/cuml/dask/__init__.py:43-44:

# Avoid "p2p" shuffling in dask for now
dask.config.set({"dataframe.shuffle.method": "tasks"})

Dask version compatibility from python/cuml/cuml/dask/_compat.py:7-15:

import packaging.version
@functools.lru_cache
def DASK_2025_4_0():
    return packaging.version.parse(dask.__version__) >= packaging.version.parse("2025.4.0")

Common Errors

Error Message Cause Solution
ModuleNotFoundError: No module named 'dask_cudf' Dask extras not installed pip install cuml-cu12[dask]
ModuleNotFoundError: No module named 'raft_dask' raft-dask not installed pip install raft-dask==26.4.*
UCX connection errors UCX library not installed or misconfigured Fall back to TCP protocol: LocalCUDACluster(protocol='tcp')
Worker GPU memory errors Data not evenly distributed Ensure one partition per worker; use persist_across_workers()

Compatibility Notes

  • P2P Shuffle: cuML forces Dask to use task-based shuffling instead of P2P (dask.config.set({"dataframe.shuffle.method": "tasks"})). P2P shuffling has known issues with GPU data.
  • UCX vs TCP: UCX protocol provides significantly better performance through GPU-direct RDMA but requires additional system library installation. TCP works out of the box.
  • NVLink: Enable with LocalCUDACluster(enable_nvlink=True) for single-node multi-GPU. Provides highest bandwidth between GPUs on the same node.
  • Dask Version: The _compat.py module tracks Dask API changes across versions, currently checking for Dask >= 2025.4.0.

Related Pages

No implementation pages currently reference this environment.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment