Heuristic:Allenai Open instruct NCCL CUMEM Disable
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Always set NCCL_CUMEM_ENABLE=0 to prevent memory fragmentation issues when using vLLM with distributed training.
Description
NCCL's unified memory (CUMEM) pooling feature causes memory allocation problems when used alongside vLLM's memory management. This manifests as subtle memory fragmentation and performance degradation. The fix is to unconditionally disable this feature by setting the environment variable at module import time, before any NCCL operations begin.
Usage
Apply this heuristic whenever training code imports NCCL-dependent modules (utils.py, grpo_fast.py, finetune.py, dpo_tune_cache.py). The setting is already applied automatically at import time in all four main training entry points.
The Insight (Rule of Thumb)
- Action: Set `os.environ["NCCL_CUMEM_ENABLE"] = "0"` at the very top of every training entry point, before any other imports.
- Value: Must be "0" (disabled).
- Trade-off: None observed. Disabling CUMEM has no measurable negative impact on training throughput.
Reasoning
The NCCL CUMEM feature (unified memory pooling) conflicts with vLLM's custom memory management, causing fragmentation that leads to allocation failures or degraded performance during distributed training. The vLLM project explicitly documents this issue and recommends disabling it. Since Open Instruct uses vLLM for GRPO generation, this setting is critical.
Code Evidence
From `open_instruct/utils.py:17-19`:
# We need to set NCCL_CUMEM_ENABLE=0 for performance reasons; see:
# https://github.com/vllm-project/vllm/issues/5723#issuecomment-2554389656
os.environ["NCCL_CUMEM_ENABLE"] = "0" # NOQA
From `open_instruct/grpo_fast.py:37`:
os.environ["NCCL_CUMEM_ENABLE"] = "0" # NOQA
From `configs/beaker_configs/ray_node_setup.sh:5`:
export NCCL_CUMEM_ENABLE=0