Heuristic:Allenai Open instruct Pre Init Torch Distributed
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Debugging |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Pre-initialize torch.distributed WITHOUT device_id before DeepSpeed to prevent NCCL hangs with multiple process groups.
Description
DeepSpeed 0.17.3+ sets `device_id` in `init_process_group`, which causes NCCL hangs when multiple process groups coexist (e.g., the training process group and a separate weight sync process group for vLLM). By initializing torch.distributed first without device_id, DeepSpeed detects the existing initialization and wraps it instead of re-initializing, avoiding the hang.
Usage
Apply this heuristic whenever using DeepSpeed with multiple NCCL process groups, which is the case in GRPO training (training group + vLLM weight sync group). Not needed for SFT or DPO training that use a single process group.
The Insight (Rule of Thumb)
- Action: Call `torch.distributed.init_process_group(backend="nccl")` BEFORE `deepspeed.init_distributed()`.
- Value: N/A (ordering constraint, not a value).
- Trade-off: None. The pre-initialization is harmless and prevents a critical hang.
Reasoning
When DeepSpeed initializes distributed with `device_id`, it creates a process group that is pinned to a specific GPU. If another process group is later created (e.g., for vLLM weight sync via NCCL), the two groups can deadlock during collective operations because of the device_id constraint. Pre-initializing without device_id creates a more flexible process group that DeepSpeed can wrap.
Code Evidence
From `open_instruct/grpo_fast.py:212-218`:
# Pre-initialize torch.distributed WITHOUT device_id to avoid NCCL hangs.
# DeepSpeed 0.17.3 and up sets device_id in init_process_group which can cause hangs
# when multiple process groups exist (e.g., for weight sync to vLLM).
# By initializing first, DeepSpeed will detect it and wrap it instead of re-initializing.
if not torch.distributed.is_initialized():
torch.distributed.init_process_group(backend="nccl", timeout=timedelta(minutes=args.backend_timeout))
deepspeed.init_distributed(timeout=timedelta(minutes=args.backend_timeout))