Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Hpcaitech ColossalAI CUDA Device Max Connections Tip

From Leeroopedia



Knowledge Sources
Domains Distributed_Training, Optimization
Last Updated 2026-02-09 03:00 GMT

Overview

Setting `CUDA_DEVICE_MAX_CONNECTIONS=1` ensures correct ordering of communication and computation kernels during distributed training.

Description

When overlapping communication with computation in distributed training, CUDA streams can execute kernels out of order relative to their CPU-side launch order. By setting `CUDA_DEVICE_MAX_CONNECTIONS=1`, the GPU is forced to use a single connection, ensuring that communication kernels (NCCL collectives) are launched before compute kernels, matching CPU-side ordering. This is critical for correctness in pipeline-parallel and tensor-parallel training.

Usage

This is automatically set by ColossalAI during initialization. It is needed whenever running distributed training with overlapped communication (pipeline parallelism, tensor parallelism with async communication).

The Insight (Rule of Thumb)

  • Action: Set `os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "1"` before initializing distributed training.
  • Value: `"1"` (single CUDA connection per device).
  • Trade-off: May slightly reduce throughput from reduced CUDA stream parallelism, but ensures correctness of communication-computation overlap.

Reasoning

Without this setting, NVIDIA GPUs can reorder kernel launches across multiple CUDA connections, causing communication operations to be delayed behind compute operations. This leads to deadlocks or incorrect results in pipeline/tensor parallel training where communication must happen in a specific order relative to computation.

Code Evidence

From `colossalai/initialize.py:6-10`:

# set CUDA_DEVICE_MAX_CONNECTIONS=1 to ensure that when overlapping communication and computation,
# the order of of kernel launches on GPUs are the same as on the CPU so that comm is launched first.
# see https://github.com/NVIDIA/Megatron-LM/issues/533
# https://forums.developer.nvidia.com/t/how-many-streams-maximum-number-of-streams/6571/16
os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "1"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment