Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Allenai Open instruct Pre Init Torch Distributed

From Leeroopedia



Knowledge Sources
Domains Distributed_Training, Debugging
Last Updated 2026-02-07 00:00 GMT

Overview

Pre-initialize torch.distributed WITHOUT device_id before DeepSpeed to prevent NCCL hangs with multiple process groups.

Description

DeepSpeed 0.17.3+ sets `device_id` in `init_process_group`, which causes NCCL hangs when multiple process groups coexist (e.g., the training process group and a separate weight sync process group for vLLM). By initializing torch.distributed first without device_id, DeepSpeed detects the existing initialization and wraps it instead of re-initializing, avoiding the hang.

Usage

Apply this heuristic whenever using DeepSpeed with multiple NCCL process groups, which is the case in GRPO training (training group + vLLM weight sync group). Not needed for SFT or DPO training that use a single process group.

The Insight (Rule of Thumb)

  • Action: Call `torch.distributed.init_process_group(backend="nccl")` BEFORE `deepspeed.init_distributed()`.
  • Value: N/A (ordering constraint, not a value).
  • Trade-off: None. The pre-initialization is harmless and prevents a critical hang.

Reasoning

When DeepSpeed initializes distributed with `device_id`, it creates a process group that is pinned to a specific GPU. If another process group is later created (e.g., for vLLM weight sync via NCCL), the two groups can deadlock during collective operations because of the device_id constraint. Pre-initializing without device_id creates a more flexible process group that DeepSpeed can wrap.

Code Evidence

From `open_instruct/grpo_fast.py:212-218`:

# Pre-initialize torch.distributed WITHOUT device_id to avoid NCCL hangs.
# DeepSpeed 0.17.3 and up sets device_id in init_process_group which can cause hangs
# when multiple process groups exist (e.g., for weight sync to vLLM).
# By initializing first, DeepSpeed will detect it and wrap it instead of re-initializing.
if not torch.distributed.is_initialized():
    torch.distributed.init_process_group(backend="nccl", timeout=timedelta(minutes=args.backend_timeout))
deepspeed.init_distributed(timeout=timedelta(minutes=args.backend_timeout))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment