Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Allenai Open instruct NCCL CUMEM Disable

From Leeroopedia





Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-07 00:00 GMT

Overview

Always set NCCL_CUMEM_ENABLE=0 to prevent memory fragmentation issues when using vLLM with distributed training.

Description

NCCL's unified memory (CUMEM) pooling feature causes memory allocation problems when used alongside vLLM's memory management. This manifests as subtle memory fragmentation and performance degradation. The fix is to unconditionally disable this feature by setting the environment variable at module import time, before any NCCL operations begin.

Usage

Apply this heuristic whenever training code imports NCCL-dependent modules (utils.py, grpo_fast.py, finetune.py, dpo_tune_cache.py). The setting is already applied automatically at import time in all four main training entry points.

The Insight (Rule of Thumb)

  • Action: Set `os.environ["NCCL_CUMEM_ENABLE"] = "0"` at the very top of every training entry point, before any other imports.
  • Value: Must be "0" (disabled).
  • Trade-off: None observed. Disabling CUMEM has no measurable negative impact on training throughput.

Reasoning

The NCCL CUMEM feature (unified memory pooling) conflicts with vLLM's custom memory management, causing fragmentation that leads to allocation failures or degraded performance during distributed training. The vLLM project explicitly documents this issue and recommends disabling it. Since Open Instruct uses vLLM for GRPO generation, this setting is critical.

Code Evidence

From `open_instruct/utils.py:17-19`:

# We need to set NCCL_CUMEM_ENABLE=0 for performance reasons; see:
# https://github.com/vllm-project/vllm/issues/5723#issuecomment-2554389656
os.environ["NCCL_CUMEM_ENABLE"] = "0"  # NOQA

From `open_instruct/grpo_fast.py:37`:

os.environ["NCCL_CUMEM_ENABLE"] = "0"  # NOQA

From `configs/beaker_configs/ray_node_setup.sh:5`:

export NCCL_CUMEM_ENABLE=0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment