Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:NVIDIA NeMo Curator RAPIDS GPU Stack

From Leeroopedia
Knowledge Sources
Domains Infrastructure, GPU_Computing, Deduplication
Last Updated 2026-02-14 16:45 GMT

Overview

NVIDIA RAPIDS GPU-accelerated stack (cuDF, cuML, cuPy, pylibcugraph) with CUDA 12 for running GPU deduplication and embedding pipelines.

Description

This environment provides the GPU-accelerated RAPIDS libraries required by NeMo Curator's deduplication stages. The MinHash, LSH, Connected Components, KMeans clustering, and Pairwise similarity stages all directly import `cudf`, `cupy`, `pylibcugraph`, and `rmm` without fallback. These are hard requirements for GPU-based deduplication — the stages will fail with `ImportError` if RAPIDS is not installed. The stack is pinned to the CUDA 12 / RAPIDS 25.10 release line.

Usage

Use this environment for any GPU-accelerated deduplication workflow: exact deduplication, fuzzy (MinHash/LSH) deduplication, and semantic deduplication. Also required for GPU-based text embedding generation using cuDF DataFrames.

System Requirements

Category Requirement Notes
OS Linux Required by both NeMo Curator and RAPIDS
Hardware NVIDIA GPU with CUDA 12 support Ampere (A100) or newer recommended
VRAM 16GB+ recommended Connected components and pairwise stages are memory-intensive
CUDA CUDA 12.x toolkit Required by cuDF-cu12, cuML-cu12 packages
Driver NVIDIA driver >= 525 Required for CUDA 12 compatibility

Dependencies

System Packages

  • CUDA 12.x toolkit
  • NVIDIA driver >= 525

Python Packages

  • `cudf-cu12` == 25.10.*
  • `cuml-cu12` == 25.10.*
  • `scikit-learn` < 1.8.0 (cuml 25.10 incompatible with sklearn 1.8.0)
  • `pylibcugraph-cu12` == 25.10.*
  • `pylibraft-cu12` == 25.10.*
  • `raft-dask-cu12` == 25.10.*
  • `rapidsmpf-cu12` == 25.10.*
  • `gpustat` (optional, for GPU monitoring)
  • `nvidia-ml-py` (optional, for pynvml GPU detection)

Credentials

No additional credentials required beyond the base environment.

Quick Install

# Install NeMo Curator with RAPIDS GPU deduplication support
pip install "nemo-curator[deduplication_cuda12]"

# Or for full text curation with GPU
pip install "nemo-curator[text_cuda12]"

Code Evidence

Direct cuDF import (no fallback) from `nemo_curator/stages/deduplication/fuzzy/minhash.py:18-20`:

import cudf
import numpy as np
import rmm

Direct pylibcugraph import from `nemo_curator/stages/deduplication/fuzzy/connected_components.py:18-22`:

import cudf
from loguru import logger
from pylibcugraph import GraphProperties, MGGraph, ResourceHandle
from pylibcugraph import weakly_connected_components as pylibcugraph_wcc
from pylibcugraph.comms.comms_wrapper import init_subcomms as c_init_subcomms

Direct cupy import from `nemo_curator/stages/deduplication/semantic/pairwise.py:20-23`:

import cudf
import cupy

scikit-learn version constraint from `pyproject.toml:79`:

"scikit-learn<1.8.0",  # cuml 25.10.0 is incompatible with scikit-learn 1.8.0

Common Errors

Error Message Cause Solution
`ModuleNotFoundError: No module named 'cudf'` RAPIDS cuDF not installed `pip install cudf-cu12==25.10.*`
`ModuleNotFoundError: No module named 'pylibcugraph'` Graph library not installed `pip install pylibcugraph-cu12==25.10.*`
`ImportError: libcuda.so` NVIDIA driver not found Install NVIDIA driver >= 525
`CUDA out of memory` during connected components Insufficient GPU VRAM Reduce input blocksize or use a GPU with more VRAM
`sklearn` version conflict scikit-learn 1.8+ installed `pip install "scikit-learn<1.8.0"`

Compatibility Notes

  • RAPIDS version pinning: All RAPIDS packages must be from the same 25.10 release. Mixing versions causes ABI incompatibilities.
  • CUDA 11: Not supported. NeMo Curator requires CUDA 12 packages.
  • AMD GPUs: Not supported. RAPIDS libraries are NVIDIA-only.
  • CPU fallback: Deduplication stages have no CPU fallback. For CPU-only environments, these stages cannot run.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment